1. Define incidents broadly enough to catch AI-specific failures
An AI service can be “up” while still failing users through hallucinations, broken retrieval, unsafe tool calls, or runaway cost. A playbook needs to classify those as incidents when the customer impact is real.
2. Create response tiers and owners
The team should know which incidents require product, platform, support, or policy involvement. Clear ownership shortens the time between detection and containment.
3. Prepare rollback and degradation paths
A useful playbook includes safe fallback states such as disabling one tool, switching to a simpler model path, or moving high-risk requests to human review. That is often faster than trying to fix the root cause live.
4. Capture evidence while the incident is active
Logs, prompts, traces, user examples, and model version data are easier to collect during the event than after it fades. The playbook should specify what evidence is mandatory.
5. Post-incident review should update controls
Each incident should strengthen monitoring, routing rules, or release gates. Otherwise the same failure returns in a different form.
Practical Checklist
- Classify silent quality failures as incidents when user impact is real.
- Define clear owners and safe degradation paths.
- Turn incident reviews into updated controls and release rules.
References
- Google SRE, Incident response
A strong baseline for incident command and communication patterns.
- OpenAI Evals Guide
Relevant when service incidents are driven by quality regressions, not infrastructure loss.
- OpenTelemetry, Concepts
Useful for evidence collection and trace-driven investigation.