DPO & Alignment

From RLHF to Structural Alignment: A Cognitive Architecture Approach

RLHF works by aligning model outputs to human preferences. But preference alignment is surface-level optimization. What we need is architecture-level alignment — systems whose internal structure naturally produces aligned behavior without external reward signals….

DPO & Alignment

Why Alignment Keeps Breaking

Every few weeks, someone publishes a new jailbreak. A new prompt injection technique. A new way to make a “safe” model produce unsafe outputs. The AI safety community patches the hole, and within days,…

DPO & Alignment

The Bamboo Test: What Adversarial Pressure Reveals About AI Alignment

Push a model hard enough and you learn what it’s made of. RLHF-aligned models have two failure modes under adversarial pressure. They either rigidify — lock down into refusal patterns that reject perfectly valid…