Evals
Graded, repeatable tests — scripted conversations scored by an LLM judge — so you can trust a change before it ships.
The test panel lets you talk to the agent yourself. Evals let you do that repeatably and at scale: each eval is a scripted conversation that gets played against the agent and graded by a language-model judge. They are how you move from "it worked when I tried it" to "it works every call".
How an eval works
Each eval scenario plays a scripted conversation against your agent and grades the result with an LLM judge. The judge decides whether the agent did what the scenario required — handled the caller's intent, called the right tools, and reached a satisfactory end.
Why you need them
A single green run proves nothing. Voice agents are non-deterministic; the same prompt can pass once and fail the next time. Evals give you:
- Repeatability — run the same scenario again after a change and compare.
- Breadth — cover the awkward cases (caller changes their mind, gives a bad detail, asks something out of scope) that you would not think to retry by hand every time.
- Confidence — trust a result that holds across several runs, never a single pass.
Treat evals as the gate: write a failing scenario for the behaviour you want before you change the prompt or a tool, then make it pass.
Reading the results
Beyond pass/fail, the eval dashboards surface the voice-quality metrics that decide whether a call feels right:
- Latency and time-to-first-byte — how quickly the agent starts responding.
- Interruptions and double-speak — the agent and caller talking over each other.
- Filler rate — how often the agent uses filler phrases to mask a wait.
These turn "the call felt off" into a number you can chase.
Billing
Eval scenarios bill on the same rule as live calls: $0.65 per scenario that resolves and passes the judge. Scenarios that fail or never resolve are free. See Billing and credits for the full table.