How many test cases do I need to start?

You can begin with 20–30 well-chosen examples. Once you’ve done a round of error analysis and found your first 3–5 failure modes, expand each into a small cluster of cases. Most mature teams stabilise around 50–200 examples.

How often should I update my AI evals?

As soon as new failure modes appear in live traces. For most teams, this means a light refresh every 1–2 weeks. If your product changes fast, update even more often. Evals are a living asset, not a one-off task.

Do I need evals if my AI feature is simple?

Not necessarily. If your product is a single model call with deterministic post-processing and a narrow use case, QA may be enough. Evals become essential once you have multi-step workflows, tool use, or anything agent-like.

How do I know if my AI evaluator is “good”?

Check how well it agrees with human labels. As a rule of thumb, aim for a true positive rate above 0.7 and a true negative rate above 0.85. If it’s far below that, refine. If it’s far above that, it may be too permissive.

Should I use an LLM-as-judge or only hand-written rules?

In most cases you’ll want to combine both. Hand-written / code-based rules are great for format and schema checks. LLM-as-judge is better for reasoning about constraints, tone, or coherence. Pick the most appropriate test for the failure modes that you’re seeing.

How much does all of this cost to run?

It depends on the model and the size of your test set. Many teams start with a cheap model for most tests and use a larger model only to validate edge cases. Being cost conscious means running small batches often, not giant suites infrequently.

Why not just run an A/B test and skip evals entirely?

A/B tests tell you about outcomes, not outputs. They can’t tell you why something failed. You need evals to diagnose and fix systematic problems, then A/B tests to see whether those fixes affect users.

Can I rely only on synthetic data?

It’s not a good idea. Synthetic data is great for coverage, but it never captures real user weirdness. Synthetic data can get you up and running quicker, but ideally you want to be using a variety of real user traces that represent how people are really using the product.

How to write effective AI Evals

Writing evals is going to become a core skill for product managers. It is such a critical part of making a good product with AI.
- Kevin Weil, CPO of OpenAI

The boring yet crucial secret behind good system prompts is test-driven development.
- Amanda Askell, Anthropic

A prompt without the evals… is like getting a broken machine without the manual.
- Malte Ubi, CTO of Vercel

If there is one thing we can teach people, it’s that writing evals is probably the most important thing.
- Mike Krieger, CPO of Anthropic

As more products integrate generative AI, evals are emerging as a core PM skill.

AI evals are the methodical measurement of application quality for non-deterministic systems. They are used to check how well an AI-driven product meets its goals.

Why? Because AI systems are probabilistic. They don’t behave the same way every time. Unlike traditional software which is deterministic, you can’t simply verify the logic you use and assume the output will always be correct.

Without AI evals, you’re flying blind. You can’t tell whether new prompts improve things or break something behind the scenes. You can’t measure regressions. And you can’t build user trust.

Evals systematically collect data and measure performance against clear criteria. Performance of your agent can then be measured and improved as you change the underlying prompts, data and other code.

What Are AI Evals

AI Evals can be thought of as analogous to QA (quality assurance), but for probabilistic rather than deterministic systems.

QA - Deterministic processes that ensure product quality is consistently met.
Evals - Systematic measurement of AI application quality.

In practice, a “probabilistic” system is any software application that includes generative AI. This might include calls to LLMs, or video or image generation. In this article we’ll refer to these as “agents”.

With traditional QA you have known test cases as the input, and know exactly what the answer should be.

As there is no variation in the system, it’s easy to write unit, integration and regression tests which check that the outputs are expected.

However, once you add generative AI to a product to make it an agent, then it will give different outputs for the same input. That makes writing tests in a logical manner much more difficult. Multiple different outputs could deliver a good user experience.

The answer here is to develop evals which test whether the output is acceptable across a number of tightly defined dimensions. These might include measures such as:

Does the agent return valid JSON that matches this schema?
Does the answer include profanity?
For a vegetarian meal plan, does the model avoid meat/fish?
Does the output sound formal enough for a business context?

Once your AI evals are defined, many test cases can be assessed, and the percentage pass/fail rate for each dimension can be calculated. This gives a benchmark against which future versions of the agent can be measured against.

Why AI evals matter

AI Evals are important for two main reasons:

1. Fast iteration and quality
Automated tests transformed how software teams ship. AI Evals do the same for agents. They expose bugs in behaviour, help you fix them quickly, and show whether each change actually improved things.

2. Detecting non-obvious issues
Some problems never show up in simple metrics. You only catch them by looking at real user traces. That’s where you find strange edge cases, small hallucinations, and places where the agent sounds confident but breaks a user constraint. For example, a recipe labelled “gluten-free” introduces allergens.

When You Need AI Evals

AI products cover a broad range of functionality, and not all require evals. This follows from a simple cost / benefit analysis.

Evals are expensive to build. They require significant time to set up, and are expensive to run, because they rely on making API calls to LLMs.

This level of investment doesn’t make sense unless you have:

Low and variable quality - you’re not getting good results without evals. There is significant variation in results, and often the results aren’t good enough for users. This will be a result of having multiple AI steps in your agent, and/or the AI steps having poor prompts and guidelines.
Quality affects business outcomes - your system is running at scale and impacting business metrics like revenue and cost. A customGPT that only you use is an AI product, but the impact of it failing is so low, it’s not worth writing evals for.

If your product is simple and deterministic - “if this, then that” - then simple checks or traditional QA is enough. There will be little unpredictability in the system.

However, as soon as your product has:

Multiple AI steps
Branching workflows
Tool use
LLM judgment calls

AND it is also being used at scale either internally or externally, then we would consider it an “agent” and AI evals become essential.

Levels of AI Eval Maturity

Of course the term evals also covers a spectrum of capabilities and thoroughness for testing agents.

Broadly you can think of three different levels of sophistication here:

Vibe checks – Quick, informal checks using pinned prompts and 1–2 example cases. Useful while you’re still shaping the agent’s behaviour and just want to spot obvious flaws.
Light evals – Small, curated set of test cases that help you validate key use cases consistently without heavy setup.
Metric-based evals – Large-scale test suites built from live or synthetic data, producing quantified performance metrics. Essential for benchmarking, iteration, and scaling with confidence.

If you are building a simple agentic workflow then vibe or light evals will often be more than sufficient. It’s only when you are building proper agents that metric-based evals become necessary.

If you’re developing a new agent from scratch then you’ll often develop your evals as your agent itself develops:

While you’re shaping an internal MVP, vibe checks will help you get going.
As you get closer to a public MVP then you will implement light evals.
As soon as you go live and start to get real user feedback then it makes sense to work on metric-based evals.

Three Types of AI Evals

If we think about the ways that we can run tests on agents, there are three categories:

Code-based tests
LLM-as-judge tests
A/B and other quantitative tests

Only code-based and LLM-as-judge tests are really evals that help you improve product quality.

A/B and other quantitative testing is in a way even more important - these help you understand whether your product is delivering the business impact that you want it to. However, they don’t give you any insight as to whether your product is working as intended. We won’t go into these in more detail in this article.

Let’s run through each of these types of tests in a bit more detail.

1. Code-based tests
These are deterministic checks you write with objective, rules-based code. They’re fast, cheap and deterministic - and you should use these sorts of tests wherever possible as a result.

Examples:

* Does the output match a JSON schema?
* Does it avoid leaking PII?
* Is the structure or format correct?

2. LLM-as-judge evals
These use another model to judge the output. Here you define a single, narrow failure mode for your product and check whether it meets this criteria. These sorts of evals can be used to test against complex or subjective criteria.

Each eval here should be narrow, binary, and calibrated against human judgment.

Examples:

* Is the tone of response appropriate?
* Is the agent accurate?
* Did the agent stick to the conditions the human user set?

3. A/B tests

These tests measure outcomes, not outputs. These sorts of tests tell you whether you’re seeing the business impact that you want, not whether the model responded correctly.

Examples:

* Did average session length increase?
* Did D30 user retention increase?
* Did conversion rate to paid plan increase?

How to Actually Run AI Evals

Having discussed the types of AI evals and the different levels of maturity you can build these to, let’s look at how you would actually create and use evals for a real agent. This guide refers to metric-based AI evals that could be used on a complex agent with real users.

Developing and using AI evals at this level is a four step process:

Collect traces - collecting test cases to assess the performance of your product.
Run error analysis - identifying and categorising the types of errors that appear.
Design evals - designing specific tests to check the errors that you’ve identified.
Integrate evals into workflow - measuring the performance of your product against these evals as you make changes.

We’ll run through each of these stages in turn.

This article is for paid subscribers

Get access to all our articles, templates and guides by upgrading to a paid subscription

Already a paid subscriber? Sign in

How to write effective AI Evals

Table of Contents

What Are AI Evals

Why AI evals matter

When You Need AI Evals

Levels of AI Eval Maturity

Three Types of AI Evals

How to Actually Run AI Evals

This article is for paid subscribers

Membership options

Monthly

Annual

Live

Business

Top articles

Subscribe and get our Ultimate Roadmap Template