Writing evals is going to become a core skill for product managers. It is such a critical part of making a good product with AI.
- Kevin Weil, CPO of OpenAI
The boring yet crucial secret behind good system prompts is test-driven development.
- Amanda Askell, Anthropic
A prompt without the evals… is like getting a broken machine without the manual.
- Malte Ubi, CTO of Vercel
If there is one thing we can teach people, it’s that writing evals is probably the most important thing.
- Mike Krieger, CPO of Anthropic
As more products integrate generative AI, evals are emerging as a core PM skill.
AI evals are the methodical measurement of application quality for non-deterministic systems. They are used to check how well an AI-driven product meets its goals.
Why? Because AI systems are probabilistic. They don’t behave the same way every time. Unlike traditional software which is deterministic, you can’t simply verify the logic you use and assume the output will always be correct.
Without AI evals, you’re flying blind. You can’t tell whether new prompts improve things or break something behind the scenes. You can’t measure regressions. And you can’t build user trust.
Evals systematically collect data and measure performance against clear criteria. Performance of your agent can then be measured and improved as you change the underlying prompts, data and other code.
What Are AI Evals
AI Evals can be thought of as analogous to QA (quality assurance), but for probabilistic rather than deterministic systems.
- QA - Deterministic processes that ensure product quality is consistently met.
- Evals - Systematic measurement of AI application quality.
In practice, a “probabilistic” system is any software application that includes generative AI. This might include calls to LLMs, or video or image generation. In this article we’ll refer to these as “agents”.
With traditional QA you have known test cases as the input, and know exactly what the answer should be.
As there is no variation in the system, it’s easy to write unit, integration and regression tests which check that the outputs are expected.
However, once you add generative AI to a product to make it an agent, then it will give different outputs for the same input. That makes writing tests in a logical manner much more difficult. Multiple different outputs could deliver a good user experience.
The answer here is to develop evals which test whether the output is acceptable across a number of tightly defined dimensions. These might include measures such as:
- Does the agent return valid JSON that matches this schema?
- Does the answer include profanity?
- For a vegetarian meal plan, does the model avoid meat/fish?
- Does the output sound formal enough for a business context?
Once your AI evals are defined, many test cases can be assessed, and the percentage pass/fail rate for each dimension can be calculated. This gives a benchmark against which future versions of the agent can be measured against.
Why AI evals matter
AI Evals are important for two main reasons:
1. Fast iteration and quality
Automated tests transformed how software teams ship. AI Evals do the same for agents. They expose bugs in behaviour, help you fix them quickly, and show whether each change actually improved things.
2. Detecting non-obvious issues
Some problems never show up in simple metrics. You only catch them by looking at real user traces. That’s where you find strange edge cases, small hallucinations, and places where the agent sounds confident but breaks a user constraint. For example, a recipe labelled “gluten-free” introduces allergens.
When You Need AI Evals
AI products cover a broad range of functionality, and not all require evals. This follows from a simple cost / benefit analysis.
Evals are expensive to build. They require significant time to set up, and are expensive to run, because they rely on making API calls to LLMs.
This level of investment doesn’t make sense unless you have:
- Low and variable quality - you’re not getting good results without evals. There is significant variation in results, and often the results aren’t good enough for users. This will be a result of having multiple AI steps in your agent, and/or the AI steps having poor prompts and guidelines.
- Quality affects business outcomes - your system is running at scale and impacting business metrics like revenue and cost. A customGPT that only you use is an AI product, but the impact of it failing is so low, it’s not worth writing evals for.
If your product is simple and deterministic - “if this, then that” - then simple checks or traditional QA is enough. There will be little unpredictability in the system.
However, as soon as your product has:
- Multiple AI steps
- Branching workflows
- Tool use
- LLM judgment calls
AND it is also being used at scale either internally or externally, then we would consider it an “agent” and AI evals become essential.
Levels of AI Eval Maturity
Of course the term evals also covers a spectrum of capabilities and thoroughness for testing agents.
Broadly you can think of three different levels of sophistication here:
- Vibe checks – Quick, informal checks using pinned prompts and 1–2 example cases. Useful while you’re still shaping the agent’s behaviour and just want to spot obvious flaws.
- Light evals – Small, curated set of test cases that help you validate key use cases consistently without heavy setup.
- Metric-based evals – Large-scale test suites built from live or synthetic data, producing quantified performance metrics. Essential for benchmarking, iteration, and scaling with confidence.
If you are building a simple agentic workflow then vibe or light evals will often be more than sufficient. It’s only when you are building proper agents that metric-based evals become necessary.
If you’re developing a new agent from scratch then you’ll often develop your evals as your agent itself develops:
- While you’re shaping an internal MVP, vibe checks will help you get going.
- As you get closer to a public MVP then you will implement light evals.
- As soon as you go live and start to get real user feedback then it makes sense to work on metric-based evals.
Three Types of AI Evals
If we think about the ways that we can run tests on agents, there are three categories:
- Code-based tests
- LLM-as-judge tests
- A/B and other quantitative tests
Only code-based and LLM-as-judge tests are really evals that help you improve product quality.
A/B and other quantitative testing is in a way even more important - these help you understand whether your product is delivering the business impact that you want it to. However, they don’t give you any insight as to whether your product is working as intended. We won’t go into these in more detail in this article.
Let’s run through each of these types of tests in a bit more detail.
1. Code-based tests
These are deterministic checks you write with objective, rules-based code. They’re fast, cheap and deterministic - and you should use these sorts of tests wherever possible as a result.
Examples:
* Does the output match a JSON schema?
* Does it avoid leaking PII?
* Is the structure or format correct?
2. LLM-as-judge evals
These use another model to judge the output. Here you define a single, narrow failure mode for your product and check whether it meets this criteria. These sorts of evals can be used to test against complex or subjective criteria.
Each eval here should be narrow, binary, and calibrated against human judgment.
Examples:
* Is the tone of response appropriate?
* Is the agent accurate?
* Did the agent stick to the conditions the human user set?
3. A/B tests
These tests measure outcomes, not outputs. These sorts of tests tell you whether you’re seeing the business impact that you want, not whether the model responded correctly.
Examples:
* Did average session length increase?
* Did D30 user retention increase?
* Did conversion rate to paid plan increase?
How to Actually Run AI Evals
Having discussed the types of AI evals and the different levels of maturity you can build these to, let’s look at how you would actually create and use evals for a real agent. This guide refers to metric-based AI evals that could be used on a complex agent with real users.
Developing and using AI evals at this level is a four step process:
- Collect traces - collecting test cases to assess the performance of your product.
- Run error analysis - identifying and categorising the types of errors that appear.
- Design evals - designing specific tests to check the errors that you’ve identified.
- Integrate evals into workflow - measuring the performance of your product against these evals as you make changes.
We’ll run through each of these stages in turn.