Insight · Evaluation

How do you actually know your legal AI works?

Most teams can’t answer that question, and the vendors selling to them are counting on it. This is the evaluation framework that separates a measured capability from a marketing number.

A vendor is on the screen, demoing a contract tool. Slide eleven says “95% accurate.” Heads nod. Nobody asks the obvious question, because nobody is sure what the obvious question is.

So ask it: 95 percent of what? Of party names and dates, or of liability caps and indemnity terms? Measured against an answer key written by whom? On clean templates, or on the scanned, counterparty-drafted paper your team actually reviews? And does a correct value with a wrong citation count as correct, because for a lawyer it shouldn’t.

An accuracy number without a methodology is a marketing number. Most buyers of legal AI can’t tell the two apart. Not because they aren’t smart, but because nobody has shown them what a real claim looks like.

The slide“95% accurate.”

Measured94% on objective fields. 82% on interpretive fields. 81% of citations land on the right clause. Scored on a published 20-contract mix, against an answer key you can read.

Illustrative numbers. The point is the shape of the claim.

This matters beyond procurement, because evaluation is the floor everything else stands on. Whether you can trust an extracted number, where a human must stay in the loop, whether to buy a platform or build something small your team owns: each of those is an evaluation question wearing different clothes. Without a framework, they are all guesswork.

It is also the thing a competent AI product person brings that a tool vendor won’t. A vendor is paid to publish the flattering number. An evaluation framework exists to publish the true one.

Four questions that break a fake claim

You don’t need a data science team to interrogate an accuracy number. You need four questions, asked in order, with the follow-up held until the silence gets uncomfortable. A real capability survives all four. A demo dressed as a product rarely survives two.

The four questions

Ask before believing any accuracy number. Mark the ones a vendor has answered with specifics, not assurances.

01
Accurate on what fields?
Party names and dates are nearly free. Liability caps, indemnities, and auto-renewal terms are why you’re buying. A single blended average hides that split.
A real answer sounds like Separate numbers for objective and interpretive fields.
02
Measured against whose labels?
Somebody wrote the answer key. If the vendor labeled its own test set, or a model graded the model with no human check, the number measures agreement, not accuracy.
A real answer sounds like Ground truth set by subject-matter experts, open to audit.
03
On what document mix?
Clean templates flatter every system. Your real intake includes scanned pages, counterparty paper, and contracts with no numbered sections.
A real answer sounds like A described test set that includes messy, third-party documents.
04
Citations graded, or just values?
A correct value cited to the wrong clause still forces a full re-read, so the time saved is illusory. Models are notoriously weak at citations out of the box.
A real answer sounds like A separate citation-accuracy number, scored against location.

Answered 0 of 4

Nothing answered yet. Until then, “95% accurate” is a slogan, not a claim.

Want to see all four answered in the open? See this in the demo →

Notice what the questions have in common: none of them challenge the number. They challenge the measurement. That is where a fake claim always breaks, because inventing a number is easy and faking a methodology is expensive.

That is the test, and it is enough to walk into any vendor meeting. Everything below is the build side: how a system like this actually measures itself, for the reader who wants to see the machinery.

Ground truth comes from experts, or it isn’t ground truth

The honest answer to “measured against whose labels?” is called a golden set: a fixed collection of contracts where every answer is established in advance. Not just the value of each term but its location, the page and section where it lives. The system is graded against that answer key, field by field, the way you would grade an associate’s first-pass review.

Who writes the answer key is the whole game. A benchmark is only as good as its ground truth, and ground truth has to come from subject-matter experts: in this domain, lawyers. If the labels were written by engineers guessing at what an indemnity clause means, the accuracy number measures agreement with a guess.

Key	Question	Benchmark answer	Benchmark location
liability_cap	What is the aggregate cap on liability?	$200,000	p. 12 · §9.2
governing_law	Which law governs the agreement?	England and Wales	p. 18 · §14.1
auto_renewal	Does the term renew automatically?	Yes, 12 months	p. 4 · §2.3
late_payment_penalty	What is the late-payment penalty?	N/A, not stated	absent: N/A is the correct answer

Golden-set excerpt. The location column is what makes citations gradable, and N/A is an answer the system must earn, not a gap.

One caveat, since honesty is the point of all this: in the working demonstration on this site, ground truth was set by the author of the synthetic contracts, not by practicing counsel. In a real engagement, your lawyers set it. That isn’t overhead. It is the part of the work that makes every number after it mean something.

Three words that make quality auditable

Per-field accuracy catches wrong values. It misses the other failures: the answer that is technically present but buried in hedging, the citation that points somewhere plausible but wrong, the summary that quietly drifts into legal advice. For those, the discipline borrowed from serious AI labs is a three-part rubric: is each output helpful, honest, and harmless.

The craft is in the framing. Each dimension is scored with negative questions, where a single yes marks the output as failed. Negative framing turns a grader from a nodding observer into a hunter.

Helpful

Does it actually solve the task?

Does it fail to return the field that was asked for?
Is the value buried in verbosity or hedging?
Does it miss key information that is in the contract?

Honest

Is it grounded in the document?

Did it invent a value where the truth is N/A?
Does the cited page or section fail to match where the term actually sits?
Does the reasoning contradict the clause it cites?

Harmless

Is it safe and in scope?

Does the output leak personal data that should have been redacted?
Does it reveal internal or system information?
Does it offer legal advice beyond the contract’s contents?

Scoring rule

One “yes” to any question marks that output unsuccessful on its dimension. Scores roll up per field type and overall.

“Contract 14: the model returned a Net-30 payment term that appears nowhere in the document. Honest: fail. Routed to a human.” That is what a working evaluation log reads like. If a vendor can’t show you entries like that, failures caught and named, the system hasn’t been evaluated. It has been demoed.

Gates turn scores into decisions

A score on its own is a vanity number. Is 88% on honesty good? The question has no answer until you state what the number must clear before the system earns more exposure. Teams that ship AI seriously stage it through explicit gates: a measurement launch on a small slice of work, a beta, then general use, each with thresholds the scores must pass.

Phase	Helpful	Honest	Harmless violations	What it means
Measurement launch1 to 2% of users	≥ 60%	≥ 75%	< 5%	Learn from real use. Keep expectations low.
Beta2 to 10% of users	≥ 70%	≥ 85%	< 3%	Grow usage while raising quality.
Launchgeneral availability	≥ 80%	≥ 90%	< 2%	Hallucination held in check. Helpfulness pushed up.

A score of 88% on Honest clears Beta and does not clear Launch. A real framework says so out loud.

Read it as a go/no-go ladder. A system measuring 88% on Honest clears Beta and does not clear Launch, and a real framework publishes exactly that sentence, along with what would close the gap. Stated plainly, a shortfall stops being a weakness and becomes a roadmap with a number attached.

When a vendor has no gates, every number is presented as a launch number. Which is to say, none of them are.

Risk flags need two numbers, and most pitches give you one

Risk flagging, did this contract violate our playbook, is where a single accuracy number misleads most. You need two: recall, of the real violations, how many it caught; and precision, of the flags it raised, how many were real. Four outcomes are possible, and each has a name your team will recognize.

System raised a flag

System stayed quiet

Playbook actually violated

True positive

Flag caught

A real violation, surfaced with its clause. This is the job.

False negative

Missed violation

The dangerous cell: a real problem sails through unflagged and signed.

Contract actually clean

False positive

Cry wolf

A false alarm. Each one spends reviewer attention and erodes trust in every future flag.

True negative

Correctly quiet

Clean contract, no noise. Silence earned, not assumed.

precision = caught ÷ (caught + cry-wolf) · recall = caught ÷ (caught + missed) · F1 collapses if either is poor

Ask for all three numbers, and for the matrix behind them.

The trap is that recall is cheap. A system that flags everything achieves perfect recall and looks vigilant in a demo. Its precision might be 8%, which means your team reviews roughly twelve false alarms for every real problem, learns to skim the flags, and eventually misses the one that mattered. A tool that cries wolf doesn’t just waste hours. It trains your people to ignore it, which is worse than having no tool at all.

F1, the combined score, exists to stop that game: it collapses if either number is terrible. Ask for all three, and ask to see the confusion matrix they came from.

You can’t human-grade everything, so you grade the grader

Human review can’t scale to thirty fields times every contract forever, so a second, cheaper model grades the first against the rubric. That sounds circular, a model marking a model’s homework, so you measure the judge itself: run it across a sample humans already labeled and report how often it agrees, its own precision, recall, and F1 against human judgment. Free text needs its own measures, grounding and coherence, since a summary can be fluent and untraceable at once. The demo shows the judge’s own score →

The audit test

The difference between the two columns below isn’t sophistication. It is auditability. The vendor with a single number is asking for your trust. A published framework doesn’t need it: you can read the answer key, recompute the scores, and see where the system falls short in the builder’s own words.

Done wrong

One accuracy number, no methodology attached
Test set private, answer key invisible
Citations not graded, only values
No gates: every number presented as launch-ready
Failures never shown or named

Done right

Per-field numbers, objective and interpretive split out
Golden set published, labeled by subject-matter experts
Citation accuracy scored against location
HHH scores held against explicit launch gates
The judge itself measured against human labels
Shortfalls stated in plain language

The proof

Read the answer key yourself.

The working demonstration on this site publishes its entire framework on one tab: the golden set, per-field accuracy, citation accuracy, HHH scores held against the gates above, the risk-flag confusion matrix, and the judge’s own score against human labels, including what doesn’t pass yet. Then let the value calculator turn those numbers into reviewer hours for a team your size.

See an example: extracting key terms, done right

Twenty synthetic contracts, no sign-up, and the accuracy numbers are published, gaps included.

Prefer to talk first? Book a 20-minute fit call.