Insight · Human-in-the-loop design

The 20% problem: where a human must stay in the loop (and how to find it automatically)

Neither ‘trust it all’ nor ‘check it all’ survives contact with a real contract workload. The usable answer is a system honest enough to know which fifth of its own output deserves a lawyer’s eyes.

Every General Counsel who has sat through an AI demo has already run the private disaster movie. The system misreads a liability cap in March. The record looks clean, so nobody opens the PDF again. The error sits in the CRM through two renewals and surfaces in November, stapled to a demand letter, and no one can say when a human last looked at the clause.

Notice what the fear is not. It is not that the AI is slow, or expensive, or awkward to roll out. The fear is autonomy: software acting alone on a document where a mistake is not a typo, it is exposure.

The reflex cure is worse than it looks. ‘A lawyer will check everything’ sounds responsible, and it quietly deletes the value. If a qualified person re-reads every field of every contract the system touches, you have not automated review, you have added a step to it. Reviewing everything defeats the purpose. Reviewing nothing is indefensible. The entire question is what sits between.

This question sits inside a cluster of properties that decide whether contract AI is usable at all. A value you cannot trace back to its clause cannot be signed off, however often it is right: that is the trust problem. A model that invents a plausible answer rather than admitting ignorance poisons everything downstream: that is the honesty problem. And neither claim means anything until someone measures it on contracts labeled by lawyers: that is the evaluation problem.

This piece is about the property those three exist to serve: deciding, field by field, when the machine may proceed on its own and when a person must look. Most debates frame it as a binary. It is a band.

Autonomy runs along a spectrum, from full-manual review of every field to full-auto with no human in sight, and the only useful zone sits in between.

A confidence band is an operating instruction, not a mood

In the pipeline this article draws on, every extracted field, all 30 of them per contract, carries a confidence band: high, medium, or low. It is tempting to read these as the model’s feelings about its own answers. They must not be. A band is only worth something once it has been checked against a labeled test set: when the system says high, it is right some measured share of the time, and when it says low, that share drops far enough that nobody should act on the value unread.

Measurement is what turns confidence from decoration into routing. Each band maps to one operational behavior, agreed before the system touches a real contract:

Band	What happens to the value	What a human does
high ▮▮▮	Enters the record.	Signs off on the batch, not the field.
medium ▮▮▯	Enters the record, sampled in the ongoing eval.	Spot-checks the sample. First band to tighten if measured accuracy slips.
low ▮▯▯	Never enters the record on its own.	Reviews it in the queue, with the citation and the top checks attached.

One band, one behavior: agreed in advance, not improvised per contract

Done wrong is easy to spot once you know what to look for. A system with no unsure state has only one move: present every answer with the same straight face. The most dangerous tools on the market are not the least accurate ones. They are the ones with no mechanism for telling you which answers to doubt.

The review queue is a workplace, not an apology

Routing only works if what lands in front of the lawyer is reviewable in minutes. Send someone a 40-page MSA with a note that says ‘something in here is uncertain’ and you have rebuilt the original job with extra steps.

So the queue item is a designed surface. It names the field, shows the value the model extracted and the exact clause it came from, says why it was routed, and lists the top three things to check, in order. The reviewer opens one clause, not one contract.

Low-confidence field

low ▮▯▯

liability_cap: two values found that disagree

§ 9.2 reads “fees paid in the preceding 12 months” · Exhibit C reads “$2,000,000”

1.Decide which value controls: § 9.2 caps liability at fees paid in the preceding 12 months, while Exhibit C states a flat $2,000,000.
2.Check the order-of-precedence clause (§ 17.4): it may already resolve the conflict in favor of the body of the agreement.
3.Confirm whether the cap was meant to carve out the indemnification obligations in § 10. This draft does not.

A queue item as the reviewer sees it · mirrors step 6 of the working demonstration

The arithmetic is the whole argument. A routed item that arrives with its citation and its three checks takes minutes. Re-reading the contract takes most of an hour. Keep the cost per routed item small, and routing 20 contracts in 100 still leaves nearly all of the saving intact.

Calibration: how a system earns the easy 80%

So where does the gate go? Not where it feels right. You run the system against a golden set, contracts already labeled by lawyers, and watch how accuracy moves across the confidence bands. The gate goes where the measured error rate of the auto-approved share is one your team is willing to sign its name to. Then you keep measuring, because contract mixes drift, and a gate calibrated in January can be wrong by June.

Done honestly, this produces the only trade a buyer should accept: the system earns autonomy on the easy 80% by being transparent about the hard 20%. It does not ask to be trusted everywhere. It asks to be trusted where it has shown its work, and it hands a person the rest.

Here is the same idea across one hundred contracts. Drag the gate and watch both failure modes appear at the ends of the range.

One hundred contracts, one gate

auto-approved · batch sign-offrouted to a human

auto-approved

routed to a human

Confidence gateroute below 80%

calibrated · 80%

The working zone. Human attention goes only where the system has measured doubt; the rest gets a batch sign-off, and the saving survives.

Synthetic distribution, shaped like the demo corpus: most fields extract cleanly, a stubborn tail does not

Want to see the queue catch the doubtful fifth? See this in the demo →

A disclaimer is not a loop

Most legal AI tools handle everything above with one sentence in the footer: ‘AI-generated responses may contain errors. Verify important information.’ Read that sentence as a system design and it says: any output might be wrong, we will not tell you which, and checking is now your job. Nothing was automated. The review burden was relabeled and handed back, along with the liability, to the person with the least information about where to look.

Lawyer-in-the-loop design makes the opposite commitment. Human attention is part of the architecture: budgeted, routed, and spent exactly where measurement says it is needed. The disclaimer says verify everything. The loop says verify these three things, on this contract, in this clause, and here is why.

Disclaimer-only tool

Unsure state: None. Every value wears the same straight face.
Confidence: Not shown, or shown and never measured.
Human role: Check everything, unguided, on your own time.
Errors surface: Downstream, in a dispute.

Lawyer-in-the-loop design

Unsure state: First-class. Low confidence exists and routes automatically.
Confidence: A measured band on every field, checked against a golden set.
Human role: Review the routed ~20%, top three checks listed.
Errors surface: In the queue, before the record exists.

This is also why enterprise buyers behave the way they do. They will forgive slow automation: a system that takes a minute per contract still beats a three-week backlog. They will not forgive blind automation, because a wrong value in a system of record does not look wrong. It looks exactly like every right value, until the day it is load-bearing. Slow costs hours. Blind costs a position you cannot defend in front of a board, a regulator, or a court. The visible seam, the bands, the queue, the routed 20%, is not an admission of weakness. It is what makes the other 80% usable.

The proof

Watch the doubtful fifth land in the queue.

None of this is hypothetical. Step 6 of the demonstration is exactly this queue: pick any of the 20 synthetic contracts, run the pipeline, and watch low-confidence fields and high-risk flags drop in, each with its top three checks already written.

See an example: extracting key terms, done right

Twenty synthetic contracts, no sign-up, and the accuracy numbers are published, gaps included.

Prefer to talk first? Book a 20-minute fit call.