← Contract review, done honestly

Insight · Build, buy, or DIY

Buy a $150K platform, rent a chatbot, or build it yourself?

The market offers a legal team two bad options. The right call depends on something most teams have never mapped: how complex each use case actually is.

You sat through the platform demo. It was genuinely impressive, and it costs $150K a year before implementation. Somewhere around minute forty it became clear that the real price was not the money. It was your team migrating templates into the vendor's clause library, reclassifying a decade of agreements, and learning to work the way the platform thinks.

Then someone in the business pointed out that a $30-a-seat chatbot “already does this.” So you tried it on a real MSA. It answered every question fluently and confidently, including the one about an indemnity carve-out that is not in the document.

Neither option fits how your team actually works. And both camps would like you to believe those are the only two choices.

The decision, mapped

Build versus buy is not one decision. It is one decision per use case.

The binary hides the real variable. Whether to buy, build, or wait turns on how complex each individual use case is, and that complexity map is exactly the thing most teams have not drawn when the vendor shows up. So they buy a platform sized for their hardest problem and use a tenth of it.

This is also where the rest of a legal team's AI worries converge. An option you cannot audit fails on traceability. One with no published accuracy fails on evaluation. One that ships your contracts somewhere opaque fails on data governance. Build versus buy is where all of those bills come due at once.

YESNOA use case on your listCommodity plumbing?(e-signature, repository, OCR)Buy itsolved, undifferentiatedHow complex is the judgment?(not the page count)LOW ▮▯▯ · BUILDPrompts are enoughoff-the-shelf model, carefulprompts, an eval setMEDIUM ▮▮▯ · BUILD, WITH HELPYour examples, your playbookencode your standards, evaluatebefore anyone trusts itHIGH ▮▮▮ · WAITKeep the humanSME-labeled data, custom work;revisit once low and medium run
One decision per use case, not one decision for the whole legal department. Most lists end up with a handful of build branches, one or two buys, and a deliberate wait.

Two things worth noticing. The buy branch is real: nobody should hand-build e-signature or document storage. And the wait branch is not defeat. Deep interpretive work stays with your lawyers, deliberately, while the lower tiers earn trust.

Use-case complexity

What makes a use case hard is judgment, not page count.

A 60-page MSA can be an easy AI problem and a 2-page letter can be a hard one. The question is how much interpretation the system has to do, and each tier of judgment needs a genuinely different approach.

LOW ▮▯▯

Routine reading

Tasks that live here

Plain-English summaries. Standard field extraction: parties, dates, renewal terms, payment terms, liability caps.

What it genuinely needs

A strong off-the-shelf model, carefully written prompts, and an evaluation set. No training data. No fine-tuning.

Share of a typical in-house workload: the bulk of weekly volume

MEDIUM ▮▮▯

Your standards, applied

Tasks that live here

Top-5 concern identification. Clause extraction specific to your templates. Risk flags scored against your own playbook.

What it genuinely needs

Examples from your contracts, your playbook encoded as editable rules, and evaluation before anyone trusts it. Light fine-tuning later, once enough examples accrue.

Share of a typical in-house workload: a meaningful slice

HIGH ▮▮▮

Deep interpretation

Tasks that live here

Liability and arbitration interplay. Negotiation strategy. Judgment calls where a wrong answer is a lawsuit.

What it genuinely needs

SME-labeled data and genuinely custom work. Often the honest answer today is: this stays with a lawyer.

Share of a typical in-house workload: a thin slice

Shares are indicative of the in-house workloads I see, not a measurement. Your actual mix is the thing worth mapping before any buying decision.

Most of what a legal team needs day to day sits in the low and medium tiers. Summaries, field extraction, renewal tracking, first-pass risk flags against your own playbook. That is exactly the work a small in-house build handles well, which is why the $150K platform is so often sized for a problem you mostly do not have.

The high tier is real, and it is where honesty matters most. Liability and arbitration interplay, negotiation judgment, anything where a wrong answer is a lawsuit: that work needs expert-labeled data and custom effort, and much of it should simply stay with a lawyer for now.

The long tail

Platforms demo on the MSA. Your risk lives in the letter agreement from 2019.

Contract platforms are, at heart, template businesses. They shine on the documents their library anticipates: the standard MSA, the standard NDA, the DPA everyone signs. Real contract estates are messier. The ad-hoc letter agreement nobody templated, the amendment to an amendment, the acquired company's entire stack on different paper.

Tail coverage is a classification problem before it is an extraction problem. A system has to recognize what kind of document it is holding, and say so plainly when it does not know, rather than forcing every page through the nearest template. That single behavior separates the three options more sharply than any pricing page.

$150K platformGeneric chatbotSmall fitted build
Fit to your contractsYour templates and process bend to its clause library.Generic. Knows nothing of your paper or your playbook.Shaped on your doc types, your fields, your playbook.
AuditabilityVaries. Risk scores are often a black box you cannot inspect.None. Fluent answers, no citations, no published accuracy.Every value cited to its clause; accuracy measured and published.
Who owns itThe vendor. You rent, and leaving is the expensive part.The vendor. Your corrections improve their product, not yours.Your team. Code, prompts, eval set, and data stay in-house.
Cost shapeSix figures a year, plus implementation, indefinitely.Cheap per seat. Expensive per error nobody caught.A fixed build cost, then cents per contract to run.
The long tailStrong on templated MSAs and NDAs, weak past its library.Reads anything, with no signal when it is out of its depth.Classifies the doc type first and says unknown honestly.
The honest column note: a fitted build is only better when the use cases are mapped first. Built against the wrong use case, it is just a cheaper way to be wrong.

Want to see the small-build option running? See this in the demo →

Infrastructure honesty

When a vector store earns its keep, and when it is resume-driven architecture.

Every contract-AI proposal now arrives wearing a vector database, and sometimes a knowledge graph. Single-contract extraction, the work that saves your team hours this quarter, needs neither.

A useful test for any proposal: ask which specific question the vector store answers that a simpler approach cannot. If the reply names one of your use cases, good. If the reply is “scale” or “future-proofing,” you are looking at resume-driven architecture, and you are paying for it.

Contract to cash

Extraction that lands in the CRM compounds. Extraction that dies in a chat window gets re-typed.

This is the quietest failure of the chatbot option: even when the answer is right, it has nowhere to go. Contract data is operational data. The renewal date belongs in the CRM with a task attached. The notice window belongs on a deadline. The payment terms belong next to billing, where someone can see the invoice does not match.

A small fitted build closes that loop by design, because you define where each field lands before you build the extraction. A chat window cannot, because a paragraph is not a record.

Where extraction usually dies

“The renewal date is March 31, 2027, with a 60-day non-renewal notice window, payment terms are Net 45, and the liability cap is $500,000.”

→ copied into an email · re-typed into the CRM · or nowhere

The contract-to-cash loop: the same answer as a record

FieldValueWhere it lands
renewal_date2027-03-31CRM: renewal pipeline, dated
auto_renew_notice60 daysTask: notify by 2027-01-30
payment_termsNet 45Billing: terms reconciled
liability_cap$500,000Deal desk: flagged for review
The second version is what step 7 of the working demo does: a mocked write to a HubSpot/Salesforce-shaped schema, so the extraction starts work instead of waiting to be re-typed.

The working demo runs this hand-off as its final pipeline step: See this in the demo →

If you build

Sequence by riskiest assumption, not by ambition.

Your engineer builds this. My job is to sequence it by riskiest assumption, design the eval that proves each phase, and make the AI calls. The build option only beats the platform if it stays small and proves itself in order. This is how a real six-week engagement sequences it, and what each later phase adds.

  1. M

    MVP · WEEKS 1-6

    Extraction with citations and confidence, landing in the CRM

    Your top two or three doc types. Every value cited to its clause, confidence scored, uncertain fields routed to a human, the record written into the CRM. Plus the evaluation set that tells you whether any of it is true.

    Riskiest assumption first: can a model read your paper accurately enough to act on? You settle that at the discovery checkpoint and prove it in the evals, not in month nine of a platform rollout.

  2. 1

    PHASE 1

    Playbook-driven risk

    Your standards encoded as rules a reviewer can read and edit. Flags say which rule fired and which clause it rests on, not just that something is risky.

  3. 2

    PHASE 2

    Redline suggestions

    The system proposes fallback language from the playbook. A lawyer accepts, edits, or rejects. The human stays the author of record.

  4. 3

    PHASE 3

    Cross-contract questions and the learning loop

    Every agreement with a liability cap under $50K, signed in the last six months. This is the phase where a vector store finally earns its keep, and where reviewer corrections start feeding back as examples.

  5. 4

    PHASE 4

    A playbook learned from your history, optionally a fine-tuned model

    Standards inferred from what your team actually negotiated, and, if the volume justifies it, a model fine-tuned on your own corpus. That model is a moat that accrues to you, not to a vendor roadmap. It is an accelerant, never the prerequisite: extraction was already working in week six on prompts alone.

Each phase ships something a lawyer uses, and each one is a clean stopping point. If the MVP disproves the premise, you stop in week six, having spent a fraction of one platform renewal.

Notice what the MVP is not. No fine-tuning, no vector store, no migration of your templates into anyone's library. Factual extraction is mostly a prompting and evaluation problem, so the expensive machinery waits for the phase that needs it.

With the use cases mapped, a small build your team owns beats both bad options. It fits your paper because it was shaped on it. It is auditable because you built the evaluation. It costs a fixed build plus cents per contract, not six figures a year. And the later phases, the learning loop and eventually a model tuned on your own contract history, are a moat that accrues to you.

None of this means buying is always wrong. Buy the commodity plumbing. And if a platform proves itself on your tail paper with published numbers, take it seriously. The point is to make that call per use case, from a map, instead of once, from fear.

The proof

See the small build, costed and measured.

The demo's “How this scales in your org” tab walks these same tiers and this same roadmap, with the live cost and accuracy numbers attached. It is the small build this article argues for, running.

Twenty synthetic contracts, no sign-up, and the accuracy numbers are published, gaps included.

Prefer to talk first? Book a 20-minute fit call.