AI Evals · LLM WIKI

Automated tests for AI systems. Not unit tests for code, but test cases for the behavior of a language model. Evals answer the question: “Did the model do the right thing?” For deterministic software, checking the output is often enough. For LLMs, correctness is usually a matter of judgment, and that judgment has to be formalized before it can be measured.

The term comes from ML research, where golden datasets compare inputs with expected outputs. Teresa Torres adapted the idea for LLM products after running into a classic prompt-engineering trap while building her Interview Coach: fix one error, create two new ones. Without evals, there is no reliable feedback loop.

Three Types of Evals

Dataset evals (golden datasets)
The classic ML approach: inputs and expected outputs in a table. You run the model across the full set and compare against the target outputs to get a score. Good for a first overview. Weakness: the test set only covers known scenarios, and for LLMs the “right output” is rarely one exact string.

Code-based evals
Rule-based, fast, reproducible. In Teresa’s Interview Coach, for example, the eval pipeline extracts all questions from the model’s answer and checks whether any of them contains words like typically, usually, or generally. If yes, fail. No AI required, no judgment required, just code. This works well when an error leaves a clear linguistic fingerprint.

LLM-as-judge
A second model evaluates the first model’s output. Example: pass the list of interview questions to a judge model and ask whether any of them is a leading question. Regex cannot solve that because it requires semantic judgment. LLM-as-judge sounds circular, but works surprisingly well if the evaluation criteria are explicit and stable. The main risk is Criteria Drift.

Teresa’s Interview Coach Setup

The coach uses eight orchestrated LLM calls, each focused on a different dimension of customer interview quality, such as setting context, building a timeline, or assessing question quality. For each of those eight calls, Teresa:

Collected real traces by running about 100 interview transcripts through the coach.
Annotated the traces manually: what was good, what failed?
Identified recurring failure patterns through Error Mode Analysis.
Wrote either a code-based eval or an LLM-as-judge eval for each persistent error.

The result is an eval set per LLM call that automatically shows whether the next prompt change improves or worsens the error rate.

Evals vs. Guardrails

Evals run after the answer and measure whether an error happened. Guardrails run before the answer reaches the user and prevent the error from escaping into production. Technically, guardrails are evals executed live.

Not every eval is suitable as a guardrail. Every extra LLM call adds cost and latency, so many evals run only on a sample of traces rather than on all traffic.

Evals Need Maintenance Too

Evals are not “write once and done.” Three reasons:

New failure modes appear - more usage reveals edge cases you did not anticipate.
Criteria Drift - judge models can drift away from human expectations without the score making that obvious.
Model updates - when the underlying model changes, prompt tricks and assumptions may stop working.

Teresa uses the metaphor of a garden: you are never finished, you just need a maintenance practice.

Evals and Discovery Are the Same Problem

Evals are only as good as your understanding of users.

A golden dataset represents only the scenarios you can imagine, not necessarily what users actually do.
Human annotators judge well only when they understand what users expect.
Synthetic data is only as realistic as the dimensions used to generate it, and those dimensions come from discovery.

So eval quality depends directly on customer understanding. That is not a coincidence. It is the same epistemic problem as Product Discovery.

Connections

Teresa Torres - built the eval system for her Interview Coach and documented the process publicly
Petra Wille - co-host of the episode
Error Mode Analysis - the main method for turning real traces into eval cases
Criteria Drift - the specific maintenance problem in LLM-as-judge systems
Synthetic Test Data for LLMs - how to bootstrap an eval set before you have enough real data
Product Discovery - methodologically adjacent and directly upstream

Sources

AI Evals & Discovery - All Things Product with Teresa & Petra - Teresa Torres + Petra Wille (2025-09)
https://www.producttalk.org/2025/09/interview-coach-evals/