The Model Merge Phenomenon: Combining Capabilities Without Training

What if you could combine the strengths of two models without retraining? Create a model that writes code like Model A but reasons like Model B? This is model merging, and it works.

Model merging takes weights from two or more models and combines them in clever ways. The result is often surprising: emergent capabilities you wouldn’t expect from simple averaging.

How Model Merging Works

The simplest merge is linear interpolation. If Model A has weights W_A and Model B has weights W_B, the merged model has weights W = (1-a)*W_A + a*W_B for some weight a.

This almost never works well. Naive averaging destroys the delicate weight distributions both models learned. But with careful techniques, it works surprisingly well.

SLERP: Spherical Linear Interpolation

SLERP (Spherical Linear Interpolation) treats weight vectors as points on a sphere. Instead of straight-line interpolation, it moves along a geodesic through the weight space.

SLERP preserves the magnitude of weight vectors better than linear interpolation. The result: merges that maintain model coherence better.

TIES Merging

TIES (Trim, Interleave, and Ensemble) is more sophisticated. It identifies the most important weight changes in each model, combines only those changes, and uses ensemble techniques to blend results.

TIES has published results showing that merging a code model with a reasoning model produces better performance on tasks requiring both skills than either model alone.

DARE Merging

DARE (Domain Adaptation and Rapid Ensemble) randomly samples weights from each model instead of averaging. Counter-intuitively, this works well for merging models fine-tuned on different datasets.

DARE is particularly good for combining multiple fine-tuned models (e.g., 5 different LoRA adapters) into a single coherent model.

Why Merging Works

The key insight is that fine-tuned models share the same base architecture and are trained from the same initialization. Their weight spaces are aligned in ways that allow meaningful interpolation.

When you merge models that diverged from the same starting point, you’re not combining arbitrary weight matrices. You’re blending carefully learned deviations from a common base.

Practical Use Cases

Combining specialized adapters: Train 5 LoRA adapters on different domains, merge them into a single multi-domain model.
Balancing trade-offs: One model is verbose but accurate. Another is concise but sometimes wrong. Merge them to balance both.

Rapid model development: Don’t have time to train? Merge two existing models and iterate from there.

Tools for Merging

mergekit is the standard tool. It handles SLERP, TIES, DARE, and custom merge strategies. Using it is trivial:

Define a YAML config specifying which models to merge and which method. Run mergekit. Get a merged model.

The process is fast (minutes, not hours) and requires no training.

The Limitation

Merging only works well when models are compatible: same architecture, similar capability levels, trained from the same initialization.

Merging a 7B and a 70B model won’t work. Merging models from different architectures won’t work. But within compatible families, merging is powerful.

What This Means

Model merging democratizes the ability to create specialized models. You don’t need to train from scratch. Combine existing models, and you often get something better than any individual model.

This is particularly powerful in the era of open source models where dozens of fine-tuned variants exist for every task.

Laeka Research — laeka.org

The Model Merge Phenomenon: Combining Capabilities Without Training

How Model Merging Works

SLERP: Spherical Linear Interpolation

TIES Merging

DARE Merging

Why Merging Works

Practical Use Cases

Tools for Merging

The Limitation

What This Means

The Chinchilla Scaling Laws Are Wrong. Here’s What Replaced Them.

Sparse Attention and Efficient Transformers: The Architecture Trends

The Inference Cost Revolution: $0.15/M Tokens Changes Everything

Binary Classification Is the Root Bug in Current AI Architecture

Model Distillation: Making Big Models Small Without Losing Quality