Designing for the 3am page

NY

Nisha Yakub

Dec 2025 · 6 min read · AI Ops

AI is not stateless

When an AI agent goes wrong at 3am, the engineer on call needs the full context: the prompt, the retrieved documents, the model outputs, the tool calls, the policy decisions. Without it, debugging is divination.

Trace everything

Every agent action is a span in a distributed trace. Tool calls, retrievals, evaluations, policy enforcements — each is a span with timing, input, output, and the prompts used. The trace is the engineer’s primary diagnostic surface.

Save the prompt

The exact prompt sent to the model at the moment of failure must be retrievable. Not the template; the rendered prompt with all variables resolved. We use a prompt store that retains the last 90 days indexed by trace ID.

Replay capability

When you find the failing trace, you must be able to re-run it locally, change one variable, and observe the new behaviour. Without replay, every debug is a forensic exercise.

The discipline

These three capabilities must be built before the platform reaches production, not after the first incident. The cost of retrofitting them is several multiples of the cost of building them in.

Want to discuss this?

A senior partner will respond within one business day.

Request Briefing