Skip to main content
A simulation runs a synthetic customer through your AI agent and verifies the outcome against explicit checks. You define how the conversation starts and what a correct response looks like. The simulation runs the agent and reports pass or fail. Each simulation stays saved and re-runnable, so it doubles as a regression test: run it again after any change to confirm nothing broke. Simulations sit alongside the faster, throwaway methods in Testing AI Agents. Use them when a flow is risky or high-volume, or when you want to lock in correct behavior so future edits can’t regress it.

When to use simulations

Simulations are worth the setup when there is a concrete outcome to assert, such as a specific reply, a tool call, a completed procedure, or an escalation. They pay off most for:
  • Risky flows that write data or have monetary impact, like cancellations and refunds, where you want every branch covered before deploying.
  • High-volume topics where a small quality gain removes a large amount of work.
  • Regression protection on behavior you have already fixed once and don’t want to break again.
For a quick one-off check while editing, the manual Guidance tab test chat or View Alternative is faster. A good habit is to create a simulation directly from a real conversation that went wrong, so it covers that case from then on.

Create a simulation

Open Simulations and add a simulation to a group. Groups organize simulations by topic or audience, such as Subscriptions or Billing.
FieldPurpose
NameDescribes the scenario, for example Cancel subscription with refund.
GroupThe group the simulation belongs to.
Start messageThe first message the synthetic customer sends, which kicks off the run.
Situation contextBackground the synthetic customer knows but won’t necessarily state upfront, used to drive realistic follow-ups.
ChecksThe assertions that decide pass or fail. See Checks.
Tool overridesMocked tool outputs so the agent never calls real systems. See Tool overrides.
Context overridesOverride runtime context such as channel, customer user, or time.

Checks

Checks are the assertions evaluated at the end of a run. A simulation passes only when all its checks pass. Add as many as you need.
CheckPasses when
Procedure finishedA specific procedure runs to completion during the simulation.
Tool usedThe agent invokes a tool with the given name at least once.
AI repliedA freeform, LLM-judged condition holds at the end of the run, for example “The agent confirmed the cancellation and offered a refund.”
EscalatedThe agent escalates the conversation to a human.
For AI replied checks, the run shows the model’s reasoning for why the condition passed or failed, which helps you tighten the wording when a judgment looks off.

Tool overrides

Tool overrides return a predetermined output whenever the agent calls a given tool, so simulations stay reproducible and never touch live systems. Use them to:
  • Avoid real side effects, such as actually cancelling a subscription.
  • Feed specific data, such as a particular refund amount or contract state.
  • Test failure handling by mocking an error and confirming the agent escalates instead of falsely confirming success.
You define reusable example outputs (a name, the tool name, and a string or JSON output) and select which ones apply to a simulation. Only one override is active per tool name.
Pair an error tool override with an Escalated check to prove the agent fails safe. If a write tool errors, the agent should escalate rather than tell the customer the action succeeded.

Run simulations and read results

Run a single simulation from its row, or select several and run them as a batch on your current Production deployment. The queue moves each run through Scheduled, Running, and then Passed, Failed, or Errored. An Errored run means the agent couldn’t complete the scenario end to end, which is itself a signal worth investigating. Open a run to see the full detail:
  • Checks with a pass or fail icon each, plus the reasoning for AI-judged checks.
  • The conversation transcript between the synthetic customer and the agent, including tool calls.
  • The agent version the run executed against, labeled Production or Immutable.
  • A history strip of recent runs so you can see when behavior changed.
If you edit a simulation after a run, the detail view flags that the run is out of date, so you know to re-run before trusting it.

Build a suite for risky changes

For a risky flow like cancellations, don’t rely on a single manual test. Write one simulation per branch of the procedure (standard case, customer with a second request, ineligible customer, tool error) and group them together. After any future edit, run the whole group with one click and deploy only when every case passes. This replaces repeated manual click testing while keeping the same safety, and it scales as the flow grows more complex.

Next steps

Procedures

Model the multi-step flows your simulations verify.

Testing AI Agents

Compare simulations with live testing and View Alternative.