Four gates before production
Gate 1 — Offline task evals
Score against frozen holdout sets by scenario family. No cherry-picked prompts, no moving targets.
Gate 2 — Safety and abuse evals
Run adversarial prompts for policy bypass, data leakage, and harmful output patterns.
Gate 3 — Human acceptance test
Domain operators review sampled outputs for tone, correctness, and practical usability.
Gate 4 — Canary in live traffic
Deploy to a constrained segment with rollback triggers defined before launch.
Metrics that matter
- Task success rate by scenario group.
- Critical failure rate (weighted by business impact).
- Refusal quality and policy adherence.
- Latency and cost deltas versus baseline.
- User correction burden per response.
Ship rule: every release needs a pre-written rollback condition. If rollback is improvised after failure, the release process is broken.