How to Build a DPO Dataset From Scratch: A Practical Guide
Building a DPO dataset from zero is methodical work. It takes planning, discipline, and iteration. This guide walks through every step, from definition to deployment.
Phase 1: Define Your Scope
What domain are you training for? Customer support? Code generation? Summarization? Academic writing? Be specific.
Define success criteria. What makes a response good in your domain? For support, maybe it’s: answers the question, acknowledges emotion, provides next steps. For code, maybe it’s: correct syntax, follows style guide, includes comments.
Write these down. They become your rubric.
Phase 2: Collect Real Prompts
You need 100-200 prompts to start. Use real user data. Don’t invent them.
Sample from your actual user base. If you’re training a support bot, pull real tickets. If it’s code generation, use real issue descriptions. If it’s writing assistance, use actual user requests.
Aim for diversity. Mix easy questions with hard ones. Include edge cases. Include common failure modes.
Phase 3: Generate Multiple Responses
For each prompt, generate 3-5 candidate responses. Use your base model or a stronger model. Vary temperature and decoding to get different styles and quality levels.
You want variation. Some responses should be clearly good. Some should be clearly bad. Some should be borderline. This creates training signal across the quality spectrum.
Phase 4: Annotate With Your Rubric
Now comes human judgment. Use your rubric. Don’t just pick “best” and “worst”. Score each response on your criteria: clarity, correctness, completeness, safety, relevance.
Record not just the scores but the reasoning. Why did Response A score higher? What did Response B miss? This diagnostic context becomes part of your training signal.
Use a tool. Google Sheets, Qualtrics, Label Studio, or even a custom Python script. Just keep it organized.
Phase 5: Extract Preference Pairs
From your scores, build preference pairs. High-scoring response vs low-scoring response. Include the diagnostic context.
Example:
Prompt: “My order hasn’t arrived. What do I do?”
Better Response: “I’m sorry your order hasn’t arrived. Let me look that up for you. Can you give me your order number? I’ll check the shipping status and we’ll figure out next steps together.” [Reason: Acknowledges frustration, asks for information, offers concrete help.]
Weaker Response: “Orders typically take 5-7 business days. If it’s been longer, contact shipping.” [Reason: Doesn’t acknowledge their frustration, doesn’t ask for details, feels robotic.]
Phase 6: Quality Checks
Check 1: Inter-rater agreement. Have a second person annotate 20% of your data. Do they agree with the first annotator? Target 70%+ agreement.
Check 2: Duplicate detection. Are any prompts repeated? Remove exact duplicates.
Check 3: Label distribution. Are your preferences balanced? Aim for roughly equal distribution across quality levels.
Check 4: Annotator consistency. Did one person annotate all data? That’s a risk. Distribute annotation load across multiple people.
Phase 7: Format and Prepare
Format your dataset consistently: JSON, CSV, HuggingFace format, whatever your training pipeline expects. Include columns for prompt, weaker_response, better_response, diagnosis, scores.
Split into train/validation: 80/20 or 70/30. Validation set should be held out from training.
Phase 8: Iterate
Train on 100 pairs first. Measure the impact on your model. Does it improve? If yes, scale to 500. If no, revisit your rubric or your annotation quality.
The first version won’t be perfect. Iterate. Add more prompts. Tighten your rubric. Remove noisy pairs.
Expected Timeline
Collecting prompts: 1-2 weeks. Generating responses: 2-3 days. Annotation: 4-6 weeks (depends on team size). Quality checks: 1 week. Iteration: ongoing.
Total: 2-3 months for a solid 500-pair dataset. Don’t rush it.
Laeka Research — laeka.org