Four gates before production

Gate 1 — Offline task evals

Score against frozen holdout sets by scenario family. No cherry-picked prompts, no moving targets.

Gate 2 — Safety and abuse evals

Run adversarial prompts for policy bypass, data leakage, and harmful output patterns.

Gate 3 — Human acceptance test

Domain operators review sampled outputs for tone, correctness, and practical usability.

Gate 4 — Canary in live traffic

Deploy to a constrained segment with rollback triggers defined before launch.

Metrics that matter

Ship rule: every release needs a pre-written rollback condition. If rollback is improvised after failure, the release process is broken.