Designing for the 3am page
Observability principles for AI workloads that have to be debugged at 3am.
AI is not stateless
When an AI agent goes wrong at 3am, the engineer on call needs the full context: the prompt, the retrieved documents, the model outputs, the tool calls, the policy decisions. Without it, debugging is divination.
Trace everything
Every agent action is a span in a distributed trace. Tool calls, retrievals, evaluations, policy enforcements — each is a span with timing, input, output, and the prompts used. The trace is the engineer’s primary diagnostic surface.
Save the prompt
The exact prompt sent to the model at the moment of failure must be retrievable. Not the template; the rendered prompt with all variables resolved. We use a prompt store that retains the last 90 days indexed by trace ID.
Replay capability
When you find the failing trace, you must be able to re-run it locally, change one variable, and observe the new behaviour. Without replay, every debug is a forensic exercise.
The discipline
These three capabilities must be built before the platform reaches production, not after the first incident. The cost of retrofitting them is several multiples of the cost of building them in.
Related articles.
Three patterns that make AI agents production-safe
4 minThree patterns that make AI agents production-safe
Read articleWhy we don’t do “AI strategy”
5 minWhy we don’t do “AI strategy”
Read articleWhy mainframe modernisation finally pays back
6 minWhy mainframe modernisation finally pays back
Read articleWant to discuss this?
A senior partner will respond within one business day.