Early access — npm package is live

Your AI agent is
making things up.
Stozer catches it
in 50ms.

Name: Stozer
Author: Stozer

Grounding validation for AI agents. Zero LLM calls.
Tool agents: 98.6% precision. RAG pipelines: transparent coverage with semantic fallback.

Get Early Access

npm install stozer-ai

TypeScript Python (soon) OpenAI · Anthropic · Gemini

The problem

AI doesn't crash.
It confidently returns wrong answers.

Whether your agent calls an API or retrieves documents — the source of truth is right there in the trace.
Stozer checks if the answer actually matches it.

TOOL AGENT

TOOL OUTPUT — source of truth

{
  "name": "Emily Carter",
  "balance": 2450,
  "status": "active"
}

AGENT RESPONSE

"Emily's balance is $2,540 and her account is active."

FAIL numeric_mismatch — expected 2450, got 2540

RAG PIPELINE

RETRIEVED DOCUMENT — source of truth

"...the refund policy allows returns within 30 days of purchase. Items must be in original packaging. Gift cards are non-refundable..."

AGENT RESPONSE

"You can return items within 60 days. All purchases are eligible for refund."

FAIL numeric_mismatch — expected 30 days, got 60

FAIL context_contradiction — gift cards excluded

How it works

Three steps. Zero infrastructure.

Drop in a few lines of code, get a verdict in milliseconds

1

Capture

Send the agent trace — tool calls, retrieved context, and the final response.

2

Validate

50+ deterministic rules check every claim against the ground truth. No LLM. Zero API calls.

3

Act

Get a pass/fail verdict with exact failure reasons. Block, warn, or log — you choose.

tool-agent.ts

const trace = new TraceBuilder()
  .addToolCall('getUser', { id: 'U-42' })
  .addToolOutput('getUser', {
    name: 'Emily Carter',
    balance: 2450,
    status: 'active'
  })
  .addFinalResponse(
    "Emily's balance is $2,450."
  ).build();

const { report } = await stozer.evaluate(trace);
// score: 1.0 — all claims grounded ✓

rag-pipeline.ts

const trace = new TraceBuilder()
  .addContextChunks([{
    text: 'Refund policy: returns within '
      + '30 days. Gift cards non-refundable.'
  }])
  .addFinalResponse(
    'Returns accepted within 60 days.'
  ).build();

const { report } = await stozer.evaluate(trace);
// score: 0.0 — numeric mismatch 30→60
// coverage: 85% deterministic

TOOL AGENTS

0%

Precision

Structured JSON — APIs, databases, functions

RAG PIPELINES

0%

Recall

Free-text docs — catches fabrications, reports coverage

<50ms

Latency

0

LLM Calls

0

Detection Rules

Detection

Six types of grounding failures.
Caught without a single LLM call.

Every claim in the response is extracted, matched to source data, and verified — deterministic rules on structured data, with optional semantic fallback for free-text.

Numeric Mismatches

Prices, dates, quantities, percentages — any number that drifts from the source data.

Tool ◆◆◆ RAG ◆◆◆

Entity Substitution

Wrong name, wrong company, wrong product. Cross-contamination between records.

Tool ◆◆◆ RAG ◆◆

Unsupported Claims

Statements with no basis in retrieved context. Fabricated policies, invented features.

Tool ◆◆◆ RAG ◆◇

Status & State Errors

Order marked "shipped" when it's "processing". Account shown "active" when suspended.

Tool ◆◆◆ RAG ◆◇

Missing Qualifications

Omitted disclaimers, dropped conditions, ignored caveats from the source material.

Tool ◆◆ RAG ◆◆◆

Temporal Errors

Outdated information presented as current. Wrong dates, expired offers, stale data.

Tool ◆◆◆ RAG ◆◆

◆ = strong detection ◇ = semantic fallback (NLI model) Strength depends on data structure, not configuration.

Why Stozer

You don't need AI
to check AI.

LLM-as-a-judge is slow, expensive, and can hallucinate itself.
Stozer uses deterministic rules.

	LLM-as-a-Judge	Stozer
Speed	200 – 2,000ms	<50ms
Cost per eval	$0.01 – $0.05	$0
Determinism	Non-deterministic	Same input → same output
Can hallucinate	Yes	No — rules don't hallucinate
Setup	Prompt engineering	Works out of the box
Explainability	Black box score	Exact failure reasons
Accuracy on tool agents	F1 85–95%	F1 97.8% (production)

For RAG pipelines (free-text documents), LLM judges may score higher on paraphrased content — Stozer excels at structured data and transparently reports its deterministic coverage on free text.

Adoption

Start with debug.
Graduate to blocking.

Five modes let you adopt incrementally.

Debug

Explore historical traces

CI/CD

Fail builds on regressions

Observe

Silent production monitor

Warn

Alert teams on failures

Block

Stop bad responses

Where it works best

Accuracy depends on
data structure, not configuration.

The more structured the source data, the higher the precision. Same engine, same rules — the data determines the result.

Tool-Calling Agents

STRUCTURED DATA — 98.6% PRECISION

AI agents that call APIs, query databases, or invoke functions and return structured JSON. Every value is typed and verifiable. Deterministic checks are near-perfect here.

Data sources

REST / GraphQL APIs SQL / NoSQL queries CRM records ERP / HRIS data Payment systems Function calling

Industries

Finance HR / Payroll E-commerce Insurance Healthcare Logistics Telecom Manufacturing

RAG Pipelines

FREE-TEXT DOCUMENTS — 88% RECALL

AI agents that retrieve and summarize documents, PDFs, or knowledge base articles. Catches fabricated facts and numbers. Paraphrases are harder — Stozer reports a coverage metric showing what it verified deterministically vs. semantically.

Data sources

PDF / Word documents Knowledge bases Wiki / Confluence Policy documents Support articles Contracts

Industries

Legal Government Education Support / Help desk Compliance Retail

Many real-world agents use both — tool calls for data, RAG for context. Stozer handles the full trace.

11 languages — EN, SR, ES, FR, PT, DE, IT, RU, HI, AR, BN

Benchmarks

Don't take our word for it.
Check the benchmarks.

Reproducible results on public and production datasets, split by data type.

Read the full benchmark report

Tool Agents — structured JSON data | RAG Pipelines — free-text documents

TOOL VERIFICATION

0% F1

Precision 96.4% Recall 93.3%

HaluEval QA: 16,662 samples. Near-zero false positives on structured API/database outputs.

API agents Database queries Tool calling

RAG VERIFICATION

0% F1

Precision 61.3% Recall 78.6%

FaithBench: 750 expert-annotated summaries. Catches fabricated data in documents; paraphrases are harder.

Document Q&A RAG chatbots Knowledge bases

PRODUCTION TRACES

0% F1

Precision 98.6% Recall 96.2%

1,500+ manually verified traces from HR, finance, and operations — predominantly tool-calling agents with structured API/database outputs.

All benchmarks reproducible. Same engine, same rules — accuracy depends on data structure, not configuration.

Be among the first
to ship AI you can trust.

Stozer is in early access. The npm package is live. The hosted platform is coming soon.

Priority dashboard access Free tier — full quality Direct line to founders

Or start now: npm install stozer-ai

Your AI agent is making things up. Stozer catches it in 50ms.

AI doesn't crash.It confidently returns wrong answers.

Three steps. Zero infrastructure.

Capture

Validate

Act

Six types of grounding failures. Caught without a single LLM call.

Numeric Mismatches

Entity Substitution

Unsupported Claims

Status & State Errors

Missing Qualifications

Temporal Errors

You don't need AIto check AI.

Start with debug.Graduate to blocking.

Accuracy depends ondata structure, not configuration.

Don't take our word for it.Check the benchmarks.

Be among the firstto ship AI you can trust.

Thank you!

Your AI agent is
making things up.
Stozer catches it
in 50ms.

AI doesn't crash.
It confidently returns wrong answers.

Six types of grounding failures.
Caught without a single LLM call.

You don't need AI
to check AI.

Start with debug.
Graduate to blocking.

Accuracy depends on
data structure, not configuration.

Don't take our word for it.
Check the benchmarks.

Be among the first
to ship AI you can trust.