From RLHF to Structural Alignment: A Cognitive Architecture Approach
RLHF was a breakthrough. It gave us a way to shape model behavior using human preferences. But it was always a patch, not a foundation. The reward model learns what humans approve of. It…