Synthetic Test Data for LLMs

If you do not yet have real user interactions, for example before launch, you can use LLMs to generate realistic inputs. This solves the bootstrap problem: without eval data you cannot launch a reliable product, but without a product you cannot collect eval data.

Teresa Torres used this for her Interview Coach. She needed interview transcripts to test the coach, but the real transcripts would only arrive once paying students were already using a coach that needed to be good enough from day one.

The Approach

Instead of generating transcripts blindly, you work with dimensions. First identify the variables that make an input realistic. For Teresa’s interview transcripts, those included:

Interview length: 8 minutes in a course setting, 30 minutes, or 60 minutes, each with very different dynamics
Interviewee type: talkative and open vs. brief and reserved
Interview type: story-based, which the coach expects, vs. not story-based, which serves as an exclusion condition

Those dimensions are passed to the LLM as generation parameters. The result is a set of transcripts representing different combinations of those dimensions, essentially a synthetic sample.

Petra Wille adds an important point: generated data should also reflect weighting. If 80 percent of real interviews fall into one category, the eval set should reflect that rather than treating all categories as evenly common.

The Honest Limitation

Synthetic data is good enough for V0. It is not good enough for a mature production product. The reason is simple: the model generating the test data has many of the same blind spots as the model being tested. Edge cases no LLM knows about will not appear in synthetic data.

The fix is to bring real traces in as early as possible. That is why ML engineers log user inputs from the start, even before they need them. Every real trace is more valuable than a hundred synthetic ones.

Connection to Discovery

The quality of the dimensions depends directly on customer understanding. If you know your users well, you can define realistic variables. If you do not, you generate synthetic data that mirrors your assumptions rather than real user scenarios. Synthetic test data is only as good as the discovery work that came before it.

Connections

AI Evals — synthetic data is one of the core sources for eval datasets
Error Mode Analysis — once real traces exist, they should replace synthetic data wherever possible
Teresa Torres — generated synthetic transcripts for the Interview Coach
Product Discovery — prerequisite for defining meaningful dimensions

Sources

AI Evals & Discovery - All Things Product with Teresa & Petra — Teresa Torres + Petra Wille (2025-09)