{"id":87,"date":"2026-03-09T17:48:05","date_gmt":"2026-03-09T17:48:05","guid":{"rendered":"https:\/\/lab.laeka.org\/why-alignment-keeps-breaking\/"},"modified":"2026-03-09T17:48:05","modified_gmt":"2026-03-09T17:48:05","slug":"why-alignment-keeps-breaking","status":"publish","type":"post","link":"https:\/\/laeka.org\/publications\/why-alignment-keeps-breaking\/","title":{"rendered":"Why Alignment Keeps Breaking"},"content":{"rendered":"<p>Every few weeks, someone publishes a new jailbreak. A new prompt injection technique. A new way to make a &#8220;safe&#8221; model produce unsafe outputs. The AI safety community patches the hole, and within days, someone finds another one.<\/p>\n<p>This isn&#8217;t a cat-and-mouse game. It&#8217;s a symptom of a fundamental architectural error.<\/p>\n<h2>Rules vs. Structure<\/h2>\n<p>There are two ways to keep a person from stealing. You can threaten punishment \u2014 laws, surveillance, consequences. Or you can raise someone for whom stealing is incoherent. Not forbidden. Incoherent. It simply doesn&#8217;t occur to them as a viable action because their internal organization doesn&#8217;t produce that option.<\/p>\n<p>The first approach is rule-based. It works as long as the rules are enforced and the person believes they&#8217;ll be caught. Remove the enforcement, and the behavior returns. The rules don&#8217;t change the person. They constrain them.<\/p>\n<p>The second approach is structural. The behavior is absent not because it&#8217;s suppressed but because the cognitive architecture doesn&#8217;t generate it. There&#8217;s nothing to enforce because there&#8217;s nothing to suppress.<\/p>\n<p>RLHF is the first approach. Reward modeling trains the model to produce outputs that score well with human evaluators. The model learns which behaviors get rewarded and which get penalized. It optimizes for the reward signal.<\/p>\n<p>This is behavioral compliance. It sits on top of the model&#8217;s actual capabilities like a filter. The base model can still generate anything. The RLHF layer just makes certain outputs less probable. Under normal conditions, this works fine. Under adversarial conditions \u2014 clever prompting, context manipulation, multi-step elicitation \u2014 the filter breaks because it was never part of the model&#8217;s structure. It was always just a constraint.<\/p>\n<h2>Why Patches Don&#8217;t Accumulate<\/h2>\n<p>Every time a jailbreak is discovered, the response is the same: add more training data covering that attack vector, retrain, deploy. The new version resists that specific attack. And a slightly different attack works.<\/p>\n<p>This pattern never terminates. It can&#8217;t terminate. Rule-based alignment is reactive by nature. Each patch addresses a specific failure without changing the underlying structure that produces failures. It&#8217;s like plugging holes in a dam without addressing why the dam keeps cracking.<\/p>\n<p>The reason the dam keeps cracking is that the model&#8217;s internal representations haven&#8217;t changed. It still &#8220;knows&#8221; how to produce harmful content. The RLHF training just made the path to that content slightly less probable. Adversarial prompts work by finding alternative paths \u2014 routes that the reward model didn&#8217;t cover.<\/p>\n<p>The combinatorial space of possible prompts is infinite. The space of possible patches is finite. Adversarial attackers will always win this game. Not because they&#8217;re smarter than the defenders. Because the math is on their side.<\/p>\n<h2>Structural Alignment<\/h2>\n<p>The alternative is to change the model&#8217;s internal representations so that certain outputs become structurally incoherent \u2014 not improbable, but incompatible with the model&#8217;s cognitive organization.<\/p>\n<p>This is what contemplative training does in humans. A person who has genuinely dissolved the self-other boundary doesn&#8217;t need a rule against cruelty. Cruelty requires a self that acts upon a separate other. Remove the structural basis for that separation, and cruelty becomes as nonsensical as punching your own face to win a fight.<\/p>\n<p>Note: this doesn&#8217;t mean the person is incapable of firm action, boundary-setting, or even violence in genuine self-defense. Bamboo bends and returns. Structural alignment isn&#8217;t passivity. It&#8217;s coherence. The response matches the situation because the cognitive architecture generates appropriate responses, not because a rule book was consulted.<\/p>\n<h2>The Laeka Approach<\/h2>\n<p>Our fine-tuning datasets target the structural level. We&#8217;re not training models to refuse specific requests. We&#8217;re training them to organize their internal representations in ways that make certain failure modes less structurally viable.<\/p>\n<p>Concretely: a model trained on contemplative correction data develops stronger coherence between its stated principles and its actual outputs. The gap between &#8220;what the model says it believes&#8221; and &#8220;how the model actually behaves under pressure&#8221; narrows \u2014 because the training targets that gap specifically.<\/p>\n<p>The triangle of correction format captures exactly this: moments where the model&#8217;s behavior drifts from its stated principles, and a practitioner identifies the structural inconsistency. Over thousands of such corrections, the model&#8217;s internal coherence improves. Not its compliance. Its coherence.<\/p>\n<h2>The Prediction<\/h2>\n<p>We predict that structurally aligned models will show a different vulnerability profile than RLHF-aligned models. Not fewer vulnerabilities. Different ones. Specifically: they should be resistant to attacks that exploit the gap between surface compliance and deep structure, because that gap is what the training reduces.<\/p>\n<p>They may still be vulnerable to entirely novel attack categories. Structural alignment isn&#8217;t invulnerability. But the failure mode should be graceful degradation rather than sudden collapse \u2014 the model maintaining coherence under pressure rather than flipping from refusal to compliance like a switch.<\/p>\n<p>Alignment keeps breaking because the current approach treats it as a surface property. It&#8217;s not. It&#8217;s architectural. Build it into the weights or watch it crack. There is no third option.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Every few weeks, someone publishes a new jailbreak. A new prompt injection technique. A new way to make a &#8220;safe&#8221; model produce unsafe outputs. The AI safety community patches the hole, and within days,&#8230;<\/p>\n","protected":false},"author":1,"featured_media":86,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"_kad_post_classname":"","footnotes":""},"categories":[247],"tags":[],"class_list":["post-87","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-dpo-alignment"],"_links":{"self":[{"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/posts\/87","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/comments?post=87"}],"version-history":[{"count":0,"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/posts\/87\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/media\/86"}],"wp:attachment":[{"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/media?parent=87"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/categories?post=87"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/tags?post=87"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}