Evals

Graded, repeatable tests — scripted conversations scored by an LLM judge — so you can trust a change before it ships.

The test panel lets you talk to the agent yourself. Evals let you do that repeatably and at scale: each eval is a scripted conversation that gets played against the agent and graded by a language-model judge. They are how you move from "it worked when I tried it" to "it works every call".

How an eval works

Each eval scenario plays a scripted conversation against your agent and grades the result with an LLM judge. The judge decides whether the agent did what the scenario required — handled the caller's intent, called the right tools, and reached a satisfactory end.

Why you need them

A single green run proves nothing. Voice agents are non-deterministic; the same prompt can pass once and fail the next time. Evals give you:

Repeatability — run the same scenario again after a change and compare.
Breadth — cover the awkward cases (caller changes their mind, gives a bad detail, asks something out of scope) that you would not think to retry by hand every time.
Confidence — trust a result that holds across several runs, never a single pass.

Treat evals as the gate: write a failing scenario for the behaviour you want before you change the prompt or a tool, then make it pass.

Reading the results

Beyond pass/fail, the eval dashboards surface the voice-quality metrics that decide whether a call feels right:

Latency and time-to-first-byte — how quickly the agent starts responding.
Interruptions and double-speak — the agent and caller talking over each other.
Filler rate — how often the agent uses filler phrases to mask a wait.

These turn "the call felt off" into a number you can chase.

Billing

Eval scenarios bill on the same rule as live calls: $0.65 per scenario that resolves and passes the judge. Scenarios that fail or never resolve are free. See Billing and credits for the full table.

How an eval works

Why you need them

Reading the results

Billing

On this page