Error Mode Analysis
Updated 2026-04-06
The systematic review of real traces from an AI system in order to name recurring failure patterns and turn them into eval cases. This is not theoretical brainstorming, but empirical work: you inspect real outputs, annotate them, and look for repetition.
Teresa Torres developed the method while building her Interview Coach. The starting point was frustration: every prompt change fixed one problem while causing another. Without a systematic view of failure modes, it was impossible to tell whether a change made the product better or worse overall.
The Process
- Collect traces: log real user interactions, storing both inputs and LLM outputs. Teresa started with 100 interview transcripts.
- Annotate manually: a domain expert reviews the traces and marks what was good and what was not. The point is not automation. The point is to capture what a knowledgeable human actually sees.
- Search for patterns: which failures recur, and which categories appear most often?
- Decide: which failure modes can be addressed through prompt changes, and which still persist?
- Write evals: for persistent failures, add either a code-based eval or an LLM-as-judge eval. See AI Evals.
Concrete Failure Modes from the Interview Coach
Teresa describes several recurring categories:
Suggesting leading questions: the coach criticizes an interviewer’s question and proposes a replacement, but the replacement is itself a leading question. The model half-understands the concept without applying it consistently. Hard to catch with simple code, so best handled with LLM-as-judge.
Suggesting general questions: a similar issue, where the revised question contains words such as typically, usually, or generally, indicating a broad, generic question instead of a specific one. Because this has a visible linguistic fingerprint, a code-based eval works.
Suggesting a question that was already answered: the coach recommends a follow-up that the interviewee already addressed in the transcript. This is a context-awareness failure.
Dimension tunnel: because the Interview Coach is split into multiple LLM calls, each responsible for one interview dimension, an individual analyzer can start interpreting the entire interview through its own lens. For example, the “set the scene” analyzer starts criticizing later parts of the interview for not also setting the scene, even though that is not its job. This is an orchestration problem, not just a prompt problem.
JSON-Markdown tick: when you ask an LLM for pure structured JSON, it sometimes prepends a Markdown code fence. That makes the output invalid for a JSON parser. Teresa saw this in roughly 1 out of 20 calls. The fix was infrastructural: with the Anthropic API you can force the beginning of the output, and making it start with { reliably prevented the issue.
Why Brainstorming Is Not Enough
You can try to imagine all failure modes in advance, but real user behavior produces edge cases no team will predict cleanly ahead of time. Error Mode Analysis is the humbler and more robust alternative: test what actually goes wrong. The eval set gets sharper as the product gets used.
That is also why real traces matter so much, as early as possible. Synthetic data and golden datasets are useful for the first version, but they do not cover unknown unknowns.
Connections
- AI Evals — Error Mode Analysis is the main source of new eval cases
- Teresa Torres — applied this method directly in the Interview Coach
- Criteria Drift — related problem where not the error itself drifts, but the standard used to judge it
- Product Discovery — same epistemic stance: stay open to what you did not expect instead of just confirming hypotheses
Sources
- YouTube: “AI Evals & Discovery - All Things Product with Teresa & Petra” (Teresa Torres + Petra Wille, 2025-09)