← Back to Custom Model Training

Quality control

Evaluation and Release Gates

If your eval framework cannot block a bad release, it is documentation cosplay, not quality control.

Four gates before production

Gate 1 — Offline task evals

Score against frozen holdout sets by scenario family. No cherry-picked prompts, no moving targets.

Gate 2 — Safety and abuse evals

Run adversarial prompts for policy bypass, data leakage, and harmful output patterns.

Gate 3 — Human acceptance test

Domain operators review sampled outputs for tone, correctness, and practical usability.

Gate 4 — Canary in live traffic

Deploy to a constrained segment with rollback triggers defined before launch.

Metrics that matter

Task success rate by scenario group.
Critical failure rate (weighted by business impact).
Refusal quality and policy adherence.
Latency and cost deltas versus baseline.
User correction burden per response.

Ship rule: every release needs a pre-written rollback condition. If rollback is improvised after failure, the release process is broken.