Grounding validation for AI agents. Zero LLM calls.
Tool agents: 98.6% precision. RAG pipelines: transparent coverage with semantic fallback.
The problem
Whether your agent calls an API or retrieves documents — the source of truth is right there in the trace.
Stozer checks if the answer actually matches it.
{
"name": "Emily Carter",
"balance": 2450,
"status": "active"
}
"Emily's balance is $2,540 and her account is active."
"...the refund policy allows returns within 30 days of purchase. Items must be in original packaging. Gift cards are non-refundable..."
"You can return items within 60 days. All purchases are eligible for refund."
How it works
Drop in a few lines of code, get a verdict in milliseconds
Send the agent trace — tool calls, retrieved context, and the final response.
50+ deterministic rules check every claim against the ground truth. No LLM. Zero API calls.
Get a pass/fail verdict with exact failure reasons. Block, warn, or log — you choose.
const trace = new TraceBuilder()
.addToolCall('getUser', { id: 'U-42' })
.addToolOutput('getUser', {
name: 'Emily Carter',
balance: 2450,
status: 'active'
})
.addFinalResponse(
"Emily's balance is $2,450."
).build();
const { report } = await stozer.evaluate(trace);
// score: 1.0 — all claims grounded ✓
const trace = new TraceBuilder()
.addContextChunks([{
text: 'Refund policy: returns within '
+ '30 days. Gift cards non-refundable.'
}])
.addFinalResponse(
'Returns accepted within 60 days.'
).build();
const { report } = await stozer.evaluate(trace);
// score: 0.0 — numeric mismatch 30→60
// coverage: 85% deterministic
Structured JSON — APIs, databases, functions
Free-text docs — catches fabrications, reports coverage
Detection
Every claim in the response is extracted, matched to source data, and verified — deterministic rules on structured data, with optional semantic fallback for free-text.
Prices, dates, quantities, percentages — any number that drifts from the source data.
Wrong name, wrong company, wrong product. Cross-contamination between records.
Statements with no basis in retrieved context. Fabricated policies, invented features.
Order marked "shipped" when it's "processing". Account shown "active" when suspended.
Omitted disclaimers, dropped conditions, ignored caveats from the source material.
Outdated information presented as current. Wrong dates, expired offers, stale data.
◆ = strong detection ◇ = semantic fallback (NLI model) Strength depends on data structure, not configuration.
Why Stozer
LLM-as-a-judge is slow, expensive, and can hallucinate itself.
Stozer uses deterministic rules.
| LLM-as-a-Judge | Stozer | |
|---|---|---|
| Speed | 200 – 2,000ms | <50ms |
| Cost per eval | $0.01 – $0.05 | $0 |
| Determinism | Non-deterministic | Same input → same output |
| Can hallucinate | Yes | No — rules don't hallucinate |
| Setup | Prompt engineering | Works out of the box |
| Explainability | Black box score | Exact failure reasons |
| Accuracy on tool agents | F1 85–95% | F1 97.8% (production) |
For RAG pipelines (free-text documents), LLM judges may score higher on paraphrased content — Stozer excels at structured data and transparently reports its deterministic coverage on free text.
Adoption
Five modes let you adopt incrementally.
Explore historical traces
Fail builds on regressions
Silent production monitor
Alert teams on failures
Stop bad responses
Where it works best
The more structured the source data, the higher the precision. Same engine, same rules — the data determines the result.
AI agents that call APIs, query databases, or invoke functions and return structured JSON. Every value is typed and verifiable. Deterministic checks are near-perfect here.
AI agents that retrieve and summarize documents, PDFs, or knowledge base articles. Catches fabricated facts and numbers. Paraphrases are harder — Stozer reports a coverage metric showing what it verified deterministically vs. semantically.
Many real-world agents use both — tool calls for data, RAG for context. Stozer handles the full trace.
11 languages — EN, SR, ES, FR, PT, DE, IT, RU, HI, AR, BN
Benchmarks
Reproducible results on public and production datasets, split by data type.
Read the full benchmark reportHaluEval QA: 16,662 samples. Near-zero false positives on structured API/database outputs.
FaithBench: 750 expert-annotated summaries. Catches fabricated data in documents; paraphrases are harder.
1,500+ manually verified traces from HR, finance, and operations — predominantly tool-calling agents with structured API/database outputs.
All benchmarks reproducible. Same engine, same rules — accuracy depends on data structure, not configuration.
Stozer is in early access. The npm package is live. The hosted platform is coming soon.
Our team will reach out to you shortly.
Or start now: npm install stozer-ai