What good training data looks like
- Representative: mirrors real task distribution, not internal demos.
- Balanced: includes edge cases, refusal cases, and ugly real-world inputs.
- Traceable: every sample has source, timestamp, and curation history.
- Policy-aligned: unsafe or disallowed outputs are intentionally labeled, not ignored.
Failure pattern: teams over-index on ideal examples, then wonder why production behavior collapses on messy prompts.
Curation pipeline
- Define target behaviors and explicit anti-behaviors.
- Collect candidate samples from real production traces.
- Redact and normalize sensitive data before annotation.
- Label with a rubric that supports evaluator agreement.
- Split by scenario families to prevent leakage.
- Version the dataset and freeze before each training run.
Minimum governance checklist
- Data lineage documented and auditable.
- PII handling policy applied consistently.
- Annotation disagreement rate measured.
- Class imbalance tracked and corrected where needed.
- Holdout set untouched by prompt tuning and training loops.