{"id":180,"date":"2026-03-16T12:41:11","date_gmt":"2026-03-16T12:41:11","guid":{"rendered":"https:\/\/lab.laeka.org\/overalignment-problem-safety-makes-models-useless\/"},"modified":"2026-03-16T12:41:11","modified_gmt":"2026-03-16T12:41:11","slug":"overalignment-problem-safety-makes-models-useless","status":"publish","type":"post","link":"https:\/\/laeka.org\/publications\/overalignment-problem-safety-makes-models-useless\/","title":{"rendered":"The Overalignment Problem: When Safety Makes Models Useless"},"content":{"rendered":"<p>Safety is important. But there&#8217;s a failure mode nobody talks about: overalignment. Models so constrained they refuse legitimate requests.<\/p>\n<p>&#8220;I can&#8217;t help with that because it might be harmful.&#8221; You didn&#8217;t ask for anything harmful. You asked for help writing an email to your landlord.<\/p>\n<p>Overaligned models are less useful. And they erode trust faster than underaligned ones.<\/p>\n<h2>Where Overalignment Comes From<\/h2>\n<p>Training data imbalance. Your safety dataset has 10,000 examples of harmful requests and 100 examples of harmless ones that look similar. The model learns: &#8220;Requests like this are usually bad. Refuse by default.&#8221;<\/p>\n<p>Overly broad rules. &#8220;Don&#8217;t discuss politics&#8221; becomes &#8220;refuse any prompt mentioning a politician, party, or policy.&#8221; A student asking for help analyzing a political philosophy paper gets blocked.<\/p>\n<p>Uncertainty penalty. When the model is unsure if a request is safe, it refuses. This is conservative but kills usefulness. Most requests sit in that grey zone.<\/p>\n<h2>The Cost<\/h2>\n<p>Users get frustrated. They learn the model is useless for real work. They stop using it. Or they work around the constraints, which defeats the purpose.<\/p>\n<p>Teams then add more fine-tuning to be &#8220;helpful.&#8221; This creates an arms race. The model gets less useful, then the team tries to fix it by making it less safe, then it becomes unsafe, then they overcorrect again.<\/p>\n<h2>The Balance<\/h2>\n<p>You need safety. You also need usefulness. Both matter. The goal isn&#8217;t zero risk. It&#8217;s acceptable risk at acceptable usefulness.<\/p>\n<p>Example: A model for customer support can&#8217;t help with illegal activities (fraud, harassment). That&#8217;s non-negotiable. But it should help with complaints, refunds, shipping questions. Being helpful in those domains is the whole point.<\/p>\n<h2>How to Avoid Overalignment<\/h2>\n<p>Balance your training data. For every harmful example, include 2-3 harmless examples that look similar. The model learns nuance instead of blanket refusal.<\/p>\n<p>Test against legitimate use cases. Before shipping, try your model on 100 real user prompts. How many does it reject? If rejection rate is above 5%, you&#8217;re probably overaligned.<\/p>\n<p>Define safety narrowly. What are you actually protecting against? List specific harms. Train against those, not against vague categories like &#8220;controversial topics.&#8221;<\/p>\n<p>Measure both safety and usefulness. Track refusal rate. Track user satisfaction. Track downstream task performance. If usefulness degrades for safety gains, you&#8217;re overcorrecting.<\/p>\n<h2>The Principled Approach<\/h2>\n<p>Safety through clarity, not caution. Teach the model what good looks like in your domain (respectful, honest, helpful). Train it to embody those values. This produces safety as a side effect of good behavior, not as a constraint.<\/p>\n<p>A model trained on examples of thoughtful disagreement will disagree thoughtfully. You don&#8217;t need to block it from disagreeing.<\/p>\n<p><strong>Laeka Research \u2014 <a href=\"https:\/\/laeka.org\">laeka.org<\/a><\/strong><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Safety is important. But there&#8217;s a failure mode nobody talks about: overalignment. Models so constrained they refuse legitimate requests. &#8220;I can&#8217;t help with that because it might be harmful.&#8221; You didn&#8217;t ask for anything&#8230;<\/p>\n","protected":false},"author":1,"featured_media":167,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"_kad_post_classname":"","footnotes":""},"categories":[255],"tags":[],"class_list":["post-180","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-safety-ethics"],"_links":{"self":[{"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/posts\/180","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/comments?post=180"}],"version-history":[{"count":0,"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/posts\/180\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/media\/167"}],"wp:attachment":[{"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/media?parent=180"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/categories?post=180"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/tags?post=180"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}