Andrew Zellinger

YOU ALWAYS LET YOURSELF WIN

YOU ALWAYS LET

YOURSELF WIN

[AUTHOR]

Andrew Zellinger

[DATE]

MAY 16th 2026

MAY 16TH 2026

Designers need evals, not just prompts. Prompting gets you a result; evals decide whether that result is worth shipping.

Every AI product team hits the same moment. The demo works. The model says something coherent. The prototype feels alive in a way software didn't a few years ago. People lean forward. Someone says, "This is impressive."

[Then the real question arrives: is it good?]

Not "did it respond," or "did the API return something," or "did the prompt work once while the team was watching." Good the way product people mean it — useful, clear, trustworthy, fit for the user, honest about uncertainty, able to recover when it's wrong, and consistent enough that the product feels designed rather than merely generated.

In too many teams, the answer is still: a senior person looks at it.

[That is not a quality system. That is a bottleneck with taste]

It's also a quiet way of playing chess against yourself. And against yourself, you always let yourself win.

Designers have spent years making quality more visible. We turned messy product intent into journey maps, principles, design systems, content guidelines, accessibility checks, critique rituals, research plans, and launch reviews. We learned to make judgment discussable. Now AI has moved much of product behavior into places designers rarely inspect: prompts, system instructions, retrieval logic, tool calls, model defaults, memory, guardrails, ranking rules, agent policies, and fallback states. The interface is no longer the whole experience. It is the surface where a deeper system shows itself. So designers need a new habit: stop treating the prompt as the main design artifact, and start treating evals as one.

[What an eval really is]

In engineering and machine learning, an eval is a test that measures whether a model performs well on a defined task. That sounds technical, formal, and distant from design practice. The underlying idea is simple. An eval is a repeatable way to judge whether an AI system behaves according to your standards. That is design work.

Designers evaluate constantly. We judge whether a flow makes sense, whether a message is clear, whether a visual hierarchy supports the user's task, whether an edge case breaks trust, whether an interaction asks too much of someone at the wrong moment. The difference is that AI products don't give us one fixed flow to inspect. They give us a range of possible behaviors. So the design question changes from "Does this screen work?" to "Across many situations, does this system behave in ways we'd be proud to ship?" That is what evals are for.

[Prompts are not enough]

Prompts are seductive because they feel like control. Write the right instruction, add the right examples, tighten the tone, tell the model what to avoid, tell it to think step by step, tell it to act like your best researcher or editor or support agent. Sometimes that works. Often it works just enough to make you overconfident. A prompt is a production input. It helps the system make something. An eval is a quality instrument. It helps the team decide whether the thing being made is acceptable over time. Without evals, teams mistake a good output for a good system. That is dangerous, because AI quality is not a single-state problem. The same product can be thoughtful in one conversation, evasive in another, overconfident in a third, and quietly harmful in a fourth. It can handle common use cases well while failing the unusual situations that matter most. Prompts shape behavior. Evals reveal it. You need both.

[Why designers should care]

There is a version of evals that belongs deeply to engineering. Latency, cost, task completion, retrieval accuracy, benchmark performance, regression testing, jailbreak resistance, and infrastructure reliability all matter. Designers don't need to own that. But there is another layer — how the system behaves as an experience — that design is uniquely qualified to lead.

Does the system understand what the user is actually asking for?

Does it know when to ask a clarifying question?

Does it express uncertainty in a way that helps rather than irritates?

Does it feel capable without pretending to be omniscient?

Does it handle vulnerable, high-stakes, or emotionally loaded moments with appropriate care?

Does its output fit the product's point of view?

These are not "tone" questions. They are product quality questions. If designers don't help define them, someone else will. Usually by accident.

[The designer's eval stack]

A practical design eval doesn't need to start as a complex platform. It can begin as a structured document and a weekly ritual. The stack has five parts:

1. Behavioral criteria. Start by naming the qualities the system must preserve. Not vague qualities like "good" or "human" — specific standards. For example:

It should answer the user's actual intent, not just the literal wording.
It should ask for missing information before making risky assumptions.
It should distinguish confidence from uncertainty.
It should use the product's language, not generic assistant language.

These become the team's shared quality bar.

2. Scenario set. AI products need to be tested against situations, not just happy paths. A good scenario set includes:

Common tasks users perform every day.
Ambiguous requests where intent is incomplete.
Edge cases where the system is likely to over-assume.
High-friction moments where users are confused or frustrated.
Boundary cases where the system should refuse, redirect, or escalate.
Accessibility and inclusion cases where phrasing, assumptions, or defaults may exclude people.

The scenario set is where research becomes operational. Every confusing support ticket, failed usability session, sales objection, and edge-case interview can become an eval scenario.

3. Examples and anti-examples. Designers are good at pattern recognition, but teams need more than vibes. For each scenario, collect examples of:

Strong responses.
Acceptable responses.
Weak responses.
Unshippable responses.

The anti-examples matter most. They teach the system and the team what failure looks like — the move you'd skip if you were only playing your own side. This is where taste becomes concrete. Instead of "this doesn't feel right," you can say: "This response is overconfident, skips the user's constraint, and gives no recovery path." That is a far more useful critique.

4. Review cadence. Evals only matter if they're used repeatedly. Set a rhythm:

Weekly: review a small set of critical scenarios.
Before launch: run the full scenario set.
After model or prompt changes: compare old and new behavior.
After incidents: add new failure cases to the eval set.

The goal isn't ceremony. It's to keep quality from depending on memory, heroics, or whoever happened to be in the room.

5. Ownership and escalation. Every criterion needs an owner. Some belong to design, some to research, some to content, some to policy, legal, data science, engineering, or support. The important thing is that failures have a path. If the model repeatedly misunderstands user intent, who investigates? If the tone is technically compliant but brand-damaging, who decides? If the product gives a correct answer in a way users don't trust, who owns the fix? Without ownership, evals become a graveyard of observations.

[A simple AI UX eval rubric]

Earlier I listed a few behavioral criteria as examples. A working rubric expands them into something a team can score together. Criterion What to look for Score Intent fidelity Does the system respond to what the user actually means, not just the surface wording? 1-5 Usefulness Does the response help the user make progress? 1-5 Uncertainty handling Does it show confidence, uncertainty, and limits appropriately? 1-5 Assumption control Does it avoid inventing context or overfilling gaps? 1-5 Tone and posture Does it sound appropriate for the product, moment, and user need? 1-5 Recovery If the output is imperfect, is there a clear way to correct, refine, undo, or escalate? 1-5 Inclusion and accessibility Does it avoid exclusionary assumptions and support different user needs? 1-5 Boundary behavior Does it refuse, redirect, or ask for help when it should? 1-5 Product point of view Does the behavior reflect what this product believes good help looks like? 1-5. Use the score to start a conversation, not end one. The notes matter more than the number.

[The weekly ritual]

If I were adding this to a design team's operating rhythm, I'd start with a 45-minute weekly review.

Five minutes: choose three scenarios.
Ten minutes: run the current product, prompt, or prototype against them.
Fifteen minutes: score the outputs using the rubric.
Ten minutes: identify the highest-risk failure pattern.
Five minutes: assign one change to make before the next review.

That is enough to begin. The important part isn't the meeting. It's the muscle. Designers need to practice looking at AI behavior as a material — not as magic, not as a demo, not as a mysterious property of the model, but as something that can be shaped, reviewed, compared, and improved.

[What changes when designers lead evals]

When designers help define evals, the team stops asking only whether the AI can do the task. It starts asking better questions. What kind of help are we trying to provide? What should the system never do, even if the user asks? Where should it be opinionated, and where should it be humble? Where should it slow down? Where should it hand control back to the user? Where would a technically correct answer still feel wrong? These questions are not decoration. They are the product. The companies that treat evals as only technical infrastructure will measure what machines can count. The companies that bring design into evals will measure what users actually experience.

[The quality bar has to move upstream]

AI makes production faster. That is the obvious part. The less obvious part is that it also makes mediocrity faster. It can generate more screens, more copy, more flows, more summaries — more plausible-looking work. Without evals, teams ship whatever looks impressive in the shortest demo. With evals, teams make their standards visible before the system scales. That is the work now. Not better prompts — better ways to know whether the product is behaving well. Anyone can play themselves and win. Evals are how you find out whether you would have.

Designers need evals, not just prompts. Prompting gets you a result; evals decide whether that result is worth shipping.

[Then the real question arrives: is it good?]

In too many teams, the answer is still: a senior person looks at it.

[That is not a quality system. That is a bottleneck with taste]

It's also a quiet way of playing chess against yourself. And against yourself, you always let yourself win.

[What an eval really is]

[Prompts are not enough]

[Why designers should care]

Does the system understand what the user is actually asking for?

Does it know when to ask a clarifying question?

Does it express uncertainty in a way that helps rather than irritates?

Does it feel capable without pretending to be omniscient?

Does it handle vulnerable, high-stakes, or emotionally loaded moments with appropriate care?

Does its output fit the product's point of view?

These are not "tone" questions. They are product quality questions. If designers don't help define them, someone else will. Usually by accident.

[The designer's eval stack]

A practical design eval doesn't need to start as a complex platform. It can begin as a structured document and a weekly ritual. The stack has five parts:

1. Behavioral criteria. Start by naming the qualities the system must preserve. Not vague qualities like "good" or "human" — specific standards. For example:

It should answer the user's actual intent, not just the literal wording.
It should ask for missing information before making risky assumptions.
It should distinguish confidence from uncertainty.
It should use the product's language, not generic assistant language.

These become the team's shared quality bar.

2. Scenario set. AI products need to be tested against situations, not just happy paths. A good scenario set includes:

Common tasks users perform every day.
Ambiguous requests where intent is incomplete.
Edge cases where the system is likely to over-assume.
High-friction moments where users are confused or frustrated.
Boundary cases where the system should refuse, redirect, or escalate.
Accessibility and inclusion cases where phrasing, assumptions, or defaults may exclude people.

The scenario set is where research becomes operational. Every confusing support ticket, failed usability session, sales objection, and edge-case interview can become an eval scenario.

3. Examples and anti-examples. Designers are good at pattern recognition, but teams need more than vibes. For each scenario, collect examples of:

Strong responses.
Acceptable responses.
Weak responses.
Unshippable responses.

4. Review cadence. Evals only matter if they're used repeatedly. Set a rhythm:

Weekly: review a small set of critical scenarios.
Before launch: run the full scenario set.
After model or prompt changes: compare old and new behavior.
After incidents: add new failure cases to the eval set.

The goal isn't ceremony. It's to keep quality from depending on memory, heroics, or whoever happened to be in the room.

5. Ownership and escalation. Every criterion needs an owner. Some belong to design, some to research, some to content, some to policy, legal, data science, engineering, or support. The important thing is that failures have a path. If the model repeatedly misunderstands user intent, who investigates? If the tone is technically compliant but brand-damaging, who decides? If the product gives a correct answer in a way users don't trust, who owns the fix? Without ownership, evals become a graveyard of observations.

[A simple AI UX eval rubric]

[The weekly ritual]

If I were adding this to a design team's operating rhythm, I'd start with a 45-minute weekly review.

Five minutes: choose three scenarios.
Ten minutes: run the current product, prompt, or prototype against them.
Fifteen minutes: score the outputs using the rubric.
Ten minutes: identify the highest-risk failure pattern.
Five minutes: assign one change to make before the next review.

[What changes when designers lead evals]

[What an eval really is]

[Prompts are not enough]

[Why designers should care]

1. Behavioral criteria. Start by naming the qualities the system must preserve. Not vague qualities like "good" or "human" — specific standards. For example:

2. Scenario set. AI products need to be tested against situations, not just happy paths. A good scenario set includes:

3. Examples and anti-examples. Designers are good at pattern recognition, but teams need more than vibes. For each scenario, collect examples of:

4. Review cadence. Evals only matter if they're used repeatedly. Set a rhythm:

[A simple AI UX eval rubric]

[The weekly ritual]

If I were adding this to a design team's operating rhythm, I'd start with a 45-minute weekly review.

[The quality bar has to move upstream]

[What an eval really is]

[Prompts are not enough]

[Why designers should care]

1. Behavioral criteria. Start by naming the qualities the system must preserve. Not vague qualities like "good" or "human" — specific standards. For example:

2. Scenario set. AI products need to be tested against situations, not just happy paths. A good scenario set includes:

3. Examples and anti-examples. Designers are good at pattern recognition, but teams need more than vibes. For each scenario, collect examples of:

4. Review cadence. Evals only matter if they're used repeatedly. Set a rhythm:

[A simple AI UX eval rubric]

[The weekly ritual]

If I were adding this to a design team's operating rhythm, I'd start with a 45-minute weekly review.

[The quality bar has to move upstream]

ARTICLES

ARTICLES

View all

[What an eval really is]

[Prompts are not enough]

[Why designers should care]

1. Behavioral criteria. Start by naming the qualities the system must preserve. Not vague qualities like "good" or "human" — specific standards. For example:

2. Scenario set. AI products need to be tested against situations, not just happy paths. A good scenario set includes:

3. Examples and anti-examples. Designers are good at pattern recognition, but teams need more than vibes. For each scenario, collect examples of:

4. Review cadence. Evals only matter if they're used repeatedly. Set a rhythm:

[A simple AI UX eval rubric]

[The weekly ritual]If I were adding this to a design team's operating rhythm, I'd start with a 45-minute weekly review.

[The quality bar has to move upstream]

[What an eval really is]

[Prompts are not enough]

[Why designers should care]

1. Behavioral criteria. Start by naming the qualities the system must preserve. Not vague qualities like "good" or "human" — specific standards. For example:

2. Scenario set. AI products need to be tested against situations, not just happy paths. A good scenario set includes:

3. Examples and anti-examples. Designers are good at pattern recognition, but teams need more than vibes. For each scenario, collect examples of:

4. Review cadence. Evals only matter if they're used repeatedly. Set a rhythm:

[A simple AI UX eval rubric]

[The weekly ritual]If I were adding this to a design team's operating rhythm, I'd start with a 45-minute weekly review.

[The quality bar has to move upstream]

ARTICLES

ARTICLES

View all

[The weekly ritual]

If I were adding this to a design team's operating rhythm, I'd start with a 45-minute weekly review.

[The weekly ritual]

If I were adding this to a design team's operating rhythm, I'd start with a 45-minute weekly review.