What good training data looks like

Failure pattern: teams over-index on ideal examples, then wonder why production behavior collapses on messy prompts.

Curation pipeline

  1. Define target behaviors and explicit anti-behaviors.
  2. Collect candidate samples from real production traces.
  3. Redact and normalize sensitive data before annotation.
  4. Label with a rubric that supports evaluator agreement.
  5. Split by scenario families to prevent leakage.
  6. Version the dataset and freeze before each training run.

Minimum governance checklist

  • Data lineage documented and auditable.
  • PII handling policy applied consistently.
  • Annotation disagreement rate measured.
  • Class imbalance tracked and corrected where needed.
  • Holdout set untouched by prompt tuning and training loops.