The quiet divergence between automated eval criteria and what humans actually consider “good.” The eval dashboard stays green, but real quality has changed because the human understanding of quality has changed.

The concept comes out of eval research and was introduced to Teresa Torres in an AI-evals course. It is one of the strongest arguments for why evals are never “write once and done.”

Why It Happens

Quality expectations are not fixed constants. What feels like a precise answer today may feel too shallow tomorrow because user expectations rise, the product matures, or the team learns more. An LLM judge that still evaluates according to old criteria has no built-in awareness of that shift.

Three concrete drift sources:

  1. Better product understanding: the team has learned what quality really means through real use, but the eval criteria were never updated.
  2. Judge-model updates: the LLM acting as judge gets updated by the provider, subtly changing its judgments.
  3. Prompt changes to the judge: small prompt edits can systematically alter decisions without making that obvious.

The Consequence

Evals themselves need evaluation. The practical method is simple: a sample of traces gets reviewed in parallel by a human and by the automated eval. The agreement rate becomes a measure of eval quality. If the rate drops, that is a signal for Criteria Drift.

Teresa Torres’ conclusion is that human annotations and automated evals need to be compared continuously, not just once during setup.

Connection to Discovery

At a deeper level, Criteria Drift is a discovery problem. If your customer picture is stale, your quality criteria become stale as well. Teams practicing continuous discovery have a natural update signal for their eval criteria.

Connections

  • AI Evals — Criteria Drift is the main maintenance risk in LLM-as-judge systems
  • Error Mode Analysis — new failure modes are often an early warning sign of criteria drift
  • Teresa Torres — described the concept in the context of the Interview Coach project
  • Product Discovery — current customer understanding is the best protection against drift

Sources

  • AI Evals & Discovery - All Things Product with Teresa & Petra — Teresa Torres + Petra Wille (2025-09)