From RLHF to Structural Alignment: A Cognitive Architecture Approach
RLHF works by aligning model outputs to human preferences. But preference alignment is surface-level optimization. What we need is architecture-level alignment — systems whose internal structure naturally produces aligned behavior without external reward signals….