Why do AI developers use both fine-tuning and RLHF/RLAIF?
Like, I get that you need to do self-supervised pre-training, because that’s a cheap way of giving it a lot of data to learn basic grammar/reasoning/etc. And fine-tuning and RLHF/RLAIF are both more expensive ways to get data.
But then why do you do both fine-tuning and RLHF? They seem like they’re trying to solve something like the same problem - giving higher-quality data on the exact sort of output we want.
And why do you do them in that order - fine-tuning and then RLHF?
My current hypothesis is that the returns from fine-tuning are initially steeper, but level off more quickly. So you want to race up the steep part of that curve, and then jump onto the RLHF curve (which starts off more shallow, but has less steeply diminishing returns).
But I’m not sure exactly why that would be the case: again my hypothesis would be
Fine-tuning
Gets good returns while you have good fine-tuning data: while you’re doing self-supervised learning it’s a rich signal - you’re getting feedback on each token of the text (as opposed to with RLHF where I think you just get a scalar feedback score on a block of text). So there’s just more data to update on, and the returns are quite steep.
But you quickly run out of high-quality data: fine-tuning requires a bunch of crowdworkers to write good text for the model to be trained on, but it quite costly and complicated to generate a lot of this data at a high quality level.
RLHF kind of bootstraps itself up a bit (like you get some bad examples, you get feedback on how good they are, it generates more examples and they’re a bit better, and then again feedback and this loop keeps working). And so it doesn’t have such steep diminishing returns.
And I guess empirically, it’s cheaper to get data from humans for RLHF than for fine-tuning. Like it’s easier to get useful data via humans giving thumbs-up-thumbs-down to train the reward model, than it is for them to write really high quality text themselves (that gets fed into self-supervised learning).
Gemini and GPT4 both seem to think that this hypothesis is basically right, but who knows.