From RLHF to Structural Alignment: A Cognitive Architecture Approach
RLHF was a breakthrough. It gave us a way to shape model behavior using human preferences. But it was always a patch, not a foundation. The reward model learns what humans approve of. It doesn’t learn what alignment actually is.
Structural alignment is different. It doesn’t train a model to perform alignment. It trains a model to be aligned — at the level of its internal representations, not just its outputs. The cognitive architecture literature shows this is possible.
The difference matters more than most researchers realize.
The RLHF Ceiling
RLHF works by training a reward model on human preferences, then using that reward model to fine-tune the language model through reinforcement learning. The language model learns to produce outputs that score high on the reward model.
The problems are well-documented. Reward hacking — the model finds outputs that score high without actually being good. Mode collapse — the model converges on a narrow range of safe, bland responses. Distributional shift — the reward model was trained on a specific distribution that doesn’t match deployment conditions.
But there’s a deeper problem that doesn’t get enough attention. RLHF creates behavioral alignment. The model acts aligned. It produces outputs that look aligned. But its internal representations haven’t changed in any meaningful way. The alignment is a surface coating, not a structural property.
This is why jailbreaks work. The underlying model hasn’t been structurally modified. The alignment layer is thin enough to be bypassed with clever prompting.
What Cognitive Science Reveals
Research into human cognition distinguishes between behavioral compliance and genuine transformation. A person who follows rules without inner change is performing conformity. A person who has undergone genuine cognitive reorganization doesn’t need rules — appropriate behavior emerges naturally from their changed internal architecture.
This distinction has been mapped extensively in cognitive science and neuroscience. Compliance operates through external enforcement. Genuine alignment operates through internal structure.
RLHF is behavioral compliance. It teaches the model to follow rules. Structural alignment aims for cognitive integration — internal transformation that makes rules largely unnecessary.
DPO as a Bridge
Direct Preference Optimization moved us closer to structural alignment. By eliminating the reward model and training directly on preference pairs, DPO modifies the model’s weights more directly. The signal is cleaner. The optimization path is shorter.
But standard DPO still uses preferences that encode behavioral signals. The chosen response is “better” in terms of human preference. This is still, fundamentally, training for behavioral alignment.
The cognitive architecture approach pushes further. Instead of asking “which response do humans prefer,” it asks “which response demonstrates deeper structural properties?” Properties like contextual sensitivity, proportional response, non-reactivity, and integration of multiple perspectives.
These aren’t surface behaviors. They’re signatures of structural alignment. A model that demonstrates these properties consistently isn’t performing alignment. It’s expressing alignment that’s been encoded at the weight level.
The Structural Alignment Framework
Structural alignment has three components that distinguish it from behavioral approaches.
Representation-level training. Instead of optimizing for output quality, optimize for the quality of internal representations. This means designing loss functions that attend to intermediate activations, not just final outputs. A structurally aligned model should show different activation patterns than a behaviorally aligned one, even when producing identical text.
Multi-dimensional preference signals. Standard DPO uses a single preference axis: better vs. worse. Structural alignment uses multiple axes simultaneously. A response can be preferred on the integration axis (demonstrates coherent reasoning) while being rejected on the precision axis (factually imprecise). Multi-dimensional signals create richer gradient landscapes.
Process-oriented evaluation. Behavioral alignment evaluates outputs. Structural alignment evaluates the process that produced the output. Two identical responses generated through different internal processes should receive different evaluations. One might demonstrate genuine contextual reasoning; the other might be pattern-matching to a template.
Practical Implementation
At Laeka Research, we’re implementing structural alignment through a modified DPO pipeline. The key innovation is in how we generate and annotate training pairs.
Each preference pair is annotated along five structural dimensions: depth of contextual integration, proportionality of response, evidence of multi-perspective reasoning, stability under perturbation, and coherence across scales (sentence-level through document-level).
The chosen response isn’t simply the one that sounds better. It’s the one that demonstrates stronger structural properties across these dimensions. Sometimes the structurally superior response is less fluent or less immediately impressive. That’s fine. We’re optimizing for alignment depth, not surface quality.
The rejected responses are carefully crafted to be behaviorally good but structurally shallow. They sound aligned. They follow all the rules. But they lack the depth markers that indicate genuine structural alignment. This teaches the model to distinguish between performance and genuine transformation.
From Behavioral Compliance to Structural Integration
The transition from RLHF to structural alignment mirrors the cognitive science findings about how internal change actually happens. Both behavioral and structural approaches are necessary stages. You can’t skip compliance to get to integration. Behavioral alignment provides the scaffolding within which structural alignment develops.
But staying at the behavioral level is a trap. It produces models that are increasingly constrained, increasingly brittle, and increasingly predictable. The next generation of aligned models won’t be the ones that follow rules most carefully. They’ll be the ones whose internal structure naturally produces aligned behavior.
That’s the cognitive architecture approach to alignment. Not better rules. Better structure.
Dive deeper into structural alignment research at Laeka Research.