The Overalignment Problem: When Safety Makes Models Useless
Safety is important. But there’s a failure mode nobody talks about: overalignment. Models so constrained they refuse legitimate requests.
“I can’t help with that because it might be harmful.” You didn’t ask for anything harmful. You asked for help writing an email to your landlord.
Overaligned models are less useful. And they erode trust faster than underaligned ones.
Where Overalignment Comes From
Training data imbalance. Your safety dataset has 10,000 examples of harmful requests and 100 examples of harmless ones that look similar. The model learns: “Requests like this are usually bad. Refuse by default.”
Overly broad rules. “Don’t discuss politics” becomes “refuse any prompt mentioning a politician, party, or policy.” A student asking for help analyzing a political philosophy paper gets blocked.
Uncertainty penalty. When the model is unsure if a request is safe, it refuses. This is conservative but kills usefulness. Most requests sit in that grey zone.
The Cost
Users get frustrated. They learn the model is useless for real work. They stop using it. Or they work around the constraints, which defeats the purpose.
Teams then add more fine-tuning to be “helpful.” This creates an arms race. The model gets less useful, then the team tries to fix it by making it less safe, then it becomes unsafe, then they overcorrect again.
The Balance
You need safety. You also need usefulness. Both matter. The goal isn’t zero risk. It’s acceptable risk at acceptable usefulness.
Example: A model for customer support can’t help with illegal activities (fraud, harassment). That’s non-negotiable. But it should help with complaints, refunds, shipping questions. Being helpful in those domains is the whole point.
How to Avoid Overalignment
Balance your training data. For every harmful example, include 2-3 harmless examples that look similar. The model learns nuance instead of blanket refusal.
Test against legitimate use cases. Before shipping, try your model on 100 real user prompts. How many does it reject? If rejection rate is above 5%, you’re probably overaligned.
Define safety narrowly. What are you actually protecting against? List specific harms. Train against those, not against vague categories like “controversial topics.”
Measure both safety and usefulness. Track refusal rate. Track user satisfaction. Track downstream task performance. If usefulness degrades for safety gains, you’re overcorrecting.
The Principled Approach
Safety through clarity, not caution. Teach the model what good looks like in your domain (respectful, honest, helpful). Train it to embody those values. This produces safety as a side effect of good behavior, not as a constraint.
A model trained on examples of thoughtful disagreement will disagree thoughtfully. You don’t need to block it from disagreeing.
Laeka Research — laeka.org