- I think reward hacking is an outer alignment problem: you’ve specified the goal wrongly, and the model ends up optimizing for the thing you wrote down rather than the thing you actually care about.
- And the goal misgeneralization is an inner alignment problem: you might have correctly specified the goal (let’s say the goal is to move a red block to the top left corner of a screen), but then you train it on a data set where it gets high reward whenever (e.g.) there’s a blue square in the image (so in the data set, whenever there’s a red block in the top left, there also happens to be a blue square in the image). And so it learns to seek blue squares. But then you deploy it in some other situation and it’s different from the data set, and it’s seeking to get blue squares rather than red-boxes-in-top-left-corners.
- And then this is different from the model just being dumb in that the model is actually successfully optimizing with a good model of the environment — like it understands how it could get the red box into the top left — but it just now cares about blue squares instead, because you’ve carved blue-square-liking grooves into its head for a really long time.
- Again, Gemini/GPT4 like this answer, but 🤷🏼