LoRA Explained: Fine-Tuning Billion Parameter Models on Your Laptop
Fine-tuning a billion-parameter model typically requires modifying billions of weights. That’s prohibitively expensive. LoRA (Low-Rank Adaptation) sidesteps this by updating only a tiny fraction of the model while achieving comparable results.
The insight is elegant: weight updates during fine-tuning have low rank. You don’t need to update the full weight matrix. You only need to update a low-rank approximation of the update.
How LoRA Works
Instead of fine-tuning the full weight matrix W, LoRA decomposes the update as a product of two smaller matrices: ΔW = B × A.
For a weight matrix of shape (d_out × d_in), LoRA introduces:
A: shape (r × d_in), where r is the rank (typically 8-64)
B: shape (d_out × r)
During forward pass: output = W × input + (B × A) × input
You only train A and B, freezing the original W. The rank r is typically much smaller than d_in and d_out, so the parameter count explodes down.
The Numbers
For a 70B model with 4k hidden dimensions:
Full fine-tuning: 70B trainable parameters
LoRA (rank 8): 70B × (8 / 4000) ≈ 140M trainable parameters
LoRA (rank 64): 70B × (64 / 4000) ≈ 1B trainable parameters
You’re training 0.2% of the model with rank-8 LoRA. The memory and compute savings are massive.
Rank: The Key Tradeoff
LoRA’s rank is the tuning knob. Higher rank = more expressiveness but more parameters.
Rank 8: Very cheap, fast training. Works for minor domain adaptation. Fine-tuning instructions or specific styles.
Rank 16-32: Sweet spot for most applications. Enough expressiveness for meaningful adaptation without excessive cost.
Rank 64+: Approaching full fine-tuning cost. Use when minor rank isn’t expressive enough.
In practice, rank 16 works for 80% of use cases. Rank 32 works for 95%. Diminishing returns set in fast.
Why It Works
The assumption underlying LoRA is empirically validated: fine-tuning updates have low intrinsic rank. The model doesn’t need to change very much to adapt to new domains or tasks.
This makes sense. The pre-trained model already encodes enormous amounts of knowledge. Adapting to a new domain doesn’t require wholesale rewiring, just targeted adjustments.
LoRA captures these adjustments efficiently.
Practical Implementation
Using LoRA in code is trivial with the peft library:
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none"
)
model = get_peft_model(model, config)
Train the model normally. Only A and B matrices get updated. At inference, merge the weights or keep them separate for easy switching between adapters.
The Practical Advantage
A 70B model with LoRA can fine-tune on a consumer GPU. Storage is minimal (rank-8 LoRA for 70B is ~140MB). You can load multiple adapters and switch between them at runtime.
This unlocks a new development model: base models + many specialized adapters. No need to train 10 different full models. Train 10 LoRA adapters instead, at 1% the cost.
Laeka Research — laeka.org