How to Write an LLM Observability Runbook

1. A runbook is an operating standard, not just an incident note

LLM failures often look similar from the user side even when the causes are completely different. Model latency, prompt regressions, retrieval failures, token explosions, and tool-call exceptions can all show up as “bad answers.”

That is why the runbook needs to connect architecture, signal definitions, failure classes, and response order in one place.

2. Define the minimum telemetry clearly

Weak observability usually means teams collect many logs but cannot connect them. At minimum, a trace should show request ID, user context, model version, prompt version, tool-call stages, latency, token usage, and final outcome.

Request metadata: session, feature, model, prompt version.
Performance signals: end-to-end latency, time to first token, retries, token usage.
Quality signals: policy blocks, refusal rate, evaluation score, user complaints, review outcomes.

3. Incident response should be hypothesis-driven

The most useful runbooks tell responders where to look first. If latency spikes, check whether the model API slowed down, retrieval is bottlenecked, or retry loops multiplied. If hallucinations increase, compare recent prompt, context, retrieval, and policy changes before blaming the model alone.

4. Evaluation belongs inside the same loop

Logs tell you what happened. Evaluation tells you whether the outcome was acceptable. A strong runbook connects production traces to routine sampling and scoring so teams can decide what to roll back or scale up.

5. Every incident should end in a reproducible change

Postmortems should record which signals arrived late, which log fields were missing, and which tests would have caught the problem earlier. Otherwise the runbook never gets stronger.

Practical Checklist

Make one trace sufficient to inspect model, prompt, tool calls, latency, and outcome together.
Write response steps in terms of hypotheses and comparison checks, not generic alerts.
Connect observability and evaluation so the same loop improves both diagnosis and quality.

References

OpenTelemetry, Observability primer
A baseline reference for logs, metrics, and traces.
OpenTelemetry, Semantic conventions for generative AI systems
Useful conventions for instrumenting LLM systems.
OpenAI, Trace grading
Current OpenAI guidance on evaluating traces and outputs.