Early access — npm package is live

Your AI agent is making things up.
Stozer catches it in 50ms.

Grounding validation for AI agents. Zero LLM calls. Tool agents: 98.6% precision. RAG pipelines: transparent coverage with semantic fallback.

Get Early Access
npm install stozer-ai
TypeScript Python (soon) OpenAI · Anthropic · Gemini

The problem

AI doesn't crash.
It confidently returns wrong answers.

Whether your agent calls an API or retrieves documents — the source of truth is right there in the trace.
Stozer checks if the answer actually matches it.

TOOL AGENT
TOOL OUTPUT — source of truth
{
  "name": "Emily Carter",
  "balance": 2450,
  "status": "active"
}
AGENT RESPONSE

"Emily's balance is $2,540 and her account is active."

FAIL numeric_mismatch — expected 2450, got 2540
RAG PIPELINE
RETRIEVED DOCUMENT — source of truth

"...the refund policy allows returns within 30 days of purchase. Items must be in original packaging. Gift cards are non-refundable..."

AGENT RESPONSE

"You can return items within 60 days. All purchases are eligible for refund."

FAIL numeric_mismatch — expected 30 days, got 60
FAIL context_contradiction — gift cards excluded

How it works

Three steps. Zero infrastructure.

Drop in a few lines of code, get a verdict in milliseconds

1

Capture

Send the agent trace — tool calls, retrieved context, and the final response.

2

Validate

50+ deterministic rules check every claim against the ground truth. No LLM. Zero API calls.

3

Act

Get a pass/fail verdict with exact failure reasons. Block, warn, or log — you choose.

tool-agent.ts
const trace = new TraceBuilder()
  .addToolCall('getUser', { id: 'U-42' })
  .addToolOutput('getUser', {
    name: 'Emily Carter',
    balance: 2450,
    status: 'active'
  })
  .addFinalResponse(
    "Emily's balance is $2,450."
  ).build();

const { report } = await stozer.evaluate(trace);
// score: 1.0 — all claims grounded ✓
rag-pipeline.ts
const trace = new TraceBuilder()
  .addContextChunks([{
    text: 'Refund policy: returns within '
      + '30 days. Gift cards non-refundable.'
  }])
  .addFinalResponse(
    'Returns accepted within 60 days.'
  ).build();

const { report } = await stozer.evaluate(trace);
// score: 0.0 — numeric mismatch 30→60
// coverage: 85% deterministic
TOOL AGENTS
0%
Precision

Structured JSON — APIs, databases, functions

RAG PIPELINES
0%
Recall

Free-text docs — catches fabrications, reports coverage

<50ms
Latency
0
LLM Calls
0
Detection Rules

Detection

Six types of grounding failures.
Caught without a single LLM call.

Every claim in the response is extracted, matched to source data, and verified — deterministic rules on structured data, with optional semantic fallback for free-text.

Numeric Mismatches

Prices, dates, quantities, percentages — any number that drifts from the source data.

Tool ◆◆◆ RAG ◆◆◆

Entity Substitution

Wrong name, wrong company, wrong product. Cross-contamination between records.

Tool ◆◆◆ RAG ◆◆

Unsupported Claims

Statements with no basis in retrieved context. Fabricated policies, invented features.

Tool ◆◆◆ RAG ◆◇

Status & State Errors

Order marked "shipped" when it's "processing". Account shown "active" when suspended.

Tool ◆◆◆ RAG ◆◇

Missing Qualifications

Omitted disclaimers, dropped conditions, ignored caveats from the source material.

Tool ◆◆ RAG ◆◆◆

Temporal Errors

Outdated information presented as current. Wrong dates, expired offers, stale data.

Tool ◆◆◆ RAG ◆◆

◆ = strong detection    ◇ = semantic fallback (NLI model)    Strength depends on data structure, not configuration.

Why Stozer

You don't need AI
to check AI.

LLM-as-a-judge is slow, expensive, and can hallucinate itself.
Stozer uses deterministic rules.

LLM-as-a-Judge Stozer
Speed 200 – 2,000ms <50ms
Cost per eval $0.01 – $0.05 $0
Determinism Non-deterministic Same input → same output
Can hallucinate Yes No — rules don't hallucinate
Setup Prompt engineering Works out of the box
Explainability Black box score Exact failure reasons
Accuracy on tool agents F1 85–95% F1 97.8% (production)

For RAG pipelines (free-text documents), LLM judges may score higher on paraphrased content — Stozer excels at structured data and transparently reports its deterministic coverage on free text.

Adoption

Start with debug.
Graduate to blocking.

Five modes let you adopt incrementally.

Debug

Explore historical traces

CI/CD

Fail builds on regressions

Observe

Silent production monitor

Warn

Alert teams on failures

Block

Stop bad responses

Where it works best

Accuracy depends on
data structure, not configuration.

The more structured the source data, the higher the precision. Same engine, same rules — the data determines the result.

Tool-Calling Agents
STRUCTURED DATA — 98.6% PRECISION

AI agents that call APIs, query databases, or invoke functions and return structured JSON. Every value is typed and verifiable. Deterministic checks are near-perfect here.

Data sources
REST / GraphQL APIs SQL / NoSQL queries CRM records ERP / HRIS data Payment systems Function calling
Industries
Finance HR / Payroll E-commerce Insurance Healthcare Logistics Telecom Manufacturing
RAG Pipelines
FREE-TEXT DOCUMENTS — 88% RECALL

AI agents that retrieve and summarize documents, PDFs, or knowledge base articles. Catches fabricated facts and numbers. Paraphrases are harder — Stozer reports a coverage metric showing what it verified deterministically vs. semantically.

Data sources
PDF / Word documents Knowledge bases Wiki / Confluence Policy documents Support articles Contracts
Industries
Legal Government Education Support / Help desk Compliance Retail

Many real-world agents use both — tool calls for data, RAG for context. Stozer handles the full trace.

11 languages — EN, SR, ES, FR, PT, DE, IT, RU, HI, AR, BN

Benchmarks

Don't take our word for it.
Check the benchmarks.

Reproducible results on public and production datasets, split by data type.

Read the full benchmark report
Tool Agents — structured JSON data | RAG Pipelines — free-text documents
TOOL VERIFICATION
0% F1
Precision 96.4% Recall 93.3%

HaluEval QA: 16,662 samples. Near-zero false positives on structured API/database outputs.

API agents Database queries Tool calling
RAG VERIFICATION
0% F1
Precision 61.3% Recall 78.6%

FaithBench: 750 expert-annotated summaries. Catches fabricated data in documents; paraphrases are harder.

Document Q&A RAG chatbots Knowledge bases
PRODUCTION TRACES
0% F1
Precision 98.6% Recall 96.2%

1,500+ manually verified traces from HR, finance, and operations — predominantly tool-calling agents with structured API/database outputs.

All benchmarks reproducible. Same engine, same rules — accuracy depends on data structure, not configuration.

Be among the first
to ship AI you can trust.

Stozer is in early access. The npm package is live. The hosted platform is coming soon.

Priority dashboard access Free tier — full quality Direct line to founders

Or start now: npm install stozer-ai