← Contract review, done honestly
Insight · Build, buy, or DIY
Buy a $150K platform, rent a chatbot, or build it yourself?
The market offers a legal team two bad options. The right call depends on something most teams have never mapped: how complex each use case actually is.
You sat through the platform demo. It was genuinely impressive, and it costs $150K a year before implementation. Somewhere around minute forty it became clear that the real price was not the money. It was your team migrating templates into the vendor's clause library, reclassifying a decade of agreements, and learning to work the way the platform thinks.
Then someone in the business pointed out that a $30-a-seat chatbot “already does this.” So you tried it on a real MSA. It answered every question fluently and confidently, including the one about an indemnity carve-out that is not in the document.
Neither option fits how your team actually works. And both camps would like you to believe those are the only two choices.
The decision, mapped
Build versus buy is not one decision. It is one decision per use case.
The binary hides the real variable. Whether to buy, build, or wait turns on how complex each individual use case is, and that complexity map is exactly the thing most teams have not drawn when the vendor shows up. So they buy a platform sized for their hardest problem and use a tenth of it.
This is also where the rest of a legal team's AI worries converge. An option you cannot audit fails on traceability. One with no published accuracy fails on evaluation. One that ships your contracts somewhere opaque fails on data governance. Build versus buy is where all of those bills come due at once.
Two things worth noticing. The buy branch is real: nobody should hand-build e-signature or document storage. And the wait branch is not defeat. Deep interpretive work stays with your lawyers, deliberately, while the lower tiers earn trust.
Use-case complexity
What makes a use case hard is judgment, not page count.
A 60-page MSA can be an easy AI problem and a 2-page letter can be a hard one. The question is how much interpretation the system has to do, and each tier of judgment needs a genuinely different approach.
Routine reading
Tasks that live here
Plain-English summaries. Standard field extraction: parties, dates, renewal terms, payment terms, liability caps.
What it genuinely needs
A strong off-the-shelf model, carefully written prompts, and an evaluation set. No training data. No fine-tuning.
Share of a typical in-house workload: the bulk of weekly volume
Your standards, applied
Tasks that live here
Top-5 concern identification. Clause extraction specific to your templates. Risk flags scored against your own playbook.
What it genuinely needs
Examples from your contracts, your playbook encoded as editable rules, and evaluation before anyone trusts it. Light fine-tuning later, once enough examples accrue.
Share of a typical in-house workload: a meaningful slice
Deep interpretation
Tasks that live here
Liability and arbitration interplay. Negotiation strategy. Judgment calls where a wrong answer is a lawsuit.
What it genuinely needs
SME-labeled data and genuinely custom work. Often the honest answer today is: this stays with a lawyer.
Share of a typical in-house workload: a thin slice
Most of what a legal team needs day to day sits in the low and medium tiers. Summaries, field extraction, renewal tracking, first-pass risk flags against your own playbook. That is exactly the work a small in-house build handles well, which is why the $150K platform is so often sized for a problem you mostly do not have.
The high tier is real, and it is where honesty matters most. Liability and arbitration interplay, negotiation judgment, anything where a wrong answer is a lawsuit: that work needs expert-labeled data and custom effort, and much of it should simply stay with a lawyer for now.
The long tail
Platforms demo on the MSA. Your risk lives in the letter agreement from 2019.
Contract platforms are, at heart, template businesses. They shine on the documents their library anticipates: the standard MSA, the standard NDA, the DPA everyone signs. Real contract estates are messier. The ad-hoc letter agreement nobody templated, the amendment to an amendment, the acquired company's entire stack on different paper.
Tail coverage is a classification problem before it is an extraction problem. A system has to recognize what kind of document it is holding, and say so plainly when it does not know, rather than forcing every page through the nearest template. That single behavior separates the three options more sharply than any pricing page.
| $150K platform | Generic chatbot | Small fitted build | |
|---|---|---|---|
| Fit to your contracts | Your templates and process bend to its clause library. | Generic. Knows nothing of your paper or your playbook. | Shaped on your doc types, your fields, your playbook. |
| Auditability | Varies. Risk scores are often a black box you cannot inspect. | None. Fluent answers, no citations, no published accuracy. | Every value cited to its clause; accuracy measured and published. |
| Who owns it | The vendor. You rent, and leaving is the expensive part. | The vendor. Your corrections improve their product, not yours. | Your team. Code, prompts, eval set, and data stay in-house. |
| Cost shape | Six figures a year, plus implementation, indefinitely. | Cheap per seat. Expensive per error nobody caught. | A fixed build cost, then cents per contract to run. |
| The long tail | Strong on templated MSAs and NDAs, weak past its library. | Reads anything, with no signal when it is out of its depth. | Classifies the doc type first and says unknown honestly. |
Want to see the small-build option running? See this in the demo →
Infrastructure honesty
When a vector store earns its keep, and when it is resume-driven architecture.
Every contract-AI proposal now arrives wearing a vector database, and sometimes a knowledge graph. Single-contract extraction, the work that saves your team hours this quarter, needs neither.
A useful test for any proposal: ask which specific question the vector store answers that a simpler approach cannot. If the reply names one of your use cases, good. If the reply is “scale” or “future-proofing,” you are looking at resume-driven architecture, and you are paying for it.
Contract to cash
Extraction that lands in the CRM compounds. Extraction that dies in a chat window gets re-typed.
This is the quietest failure of the chatbot option: even when the answer is right, it has nowhere to go. Contract data is operational data. The renewal date belongs in the CRM with a task attached. The notice window belongs on a deadline. The payment terms belong next to billing, where someone can see the invoice does not match.
A small fitted build closes that loop by design, because you define where each field lands before you build the extraction. A chat window cannot, because a paragraph is not a record.
Where extraction usually dies
“The renewal date is March 31, 2027, with a 60-day non-renewal notice window, payment terms are Net 45, and the liability cap is $500,000.”
→ copied into an email · re-typed into the CRM · or nowhere
The contract-to-cash loop: the same answer as a record
| Field | Value | Where it lands |
|---|---|---|
| renewal_date | 2027-03-31 | CRM: renewal pipeline, dated |
| auto_renew_notice | 60 days | Task: notify by 2027-01-30 |
| payment_terms | Net 45 | Billing: terms reconciled |
| liability_cap | $500,000 | Deal desk: flagged for review |
The working demo runs this hand-off as its final pipeline step: See this in the demo →
If you build
Sequence by riskiest assumption, not by ambition.
Your engineer builds this. My job is to sequence it by riskiest assumption, design the eval that proves each phase, and make the AI calls. The build option only beats the platform if it stays small and proves itself in order. This is how a real six-week engagement sequences it, and what each later phase adds.
- M
MVP · WEEKS 1-6
Extraction with citations and confidence, landing in the CRM
Your top two or three doc types. Every value cited to its clause, confidence scored, uncertain fields routed to a human, the record written into the CRM. Plus the evaluation set that tells you whether any of it is true.
Riskiest assumption first: can a model read your paper accurately enough to act on? You settle that at the discovery checkpoint and prove it in the evals, not in month nine of a platform rollout.
- 1
PHASE 1
Playbook-driven risk
Your standards encoded as rules a reviewer can read and edit. Flags say which rule fired and which clause it rests on, not just that something is risky.
- 2
PHASE 2
Redline suggestions
The system proposes fallback language from the playbook. A lawyer accepts, edits, or rejects. The human stays the author of record.
- 3
PHASE 3
Cross-contract questions and the learning loop
Every agreement with a liability cap under $50K, signed in the last six months. This is the phase where a vector store finally earns its keep, and where reviewer corrections start feeding back as examples.
- 4
PHASE 4
A playbook learned from your history, optionally a fine-tuned model
Standards inferred from what your team actually negotiated, and, if the volume justifies it, a model fine-tuned on your own corpus. That model is a moat that accrues to you, not to a vendor roadmap. It is an accelerant, never the prerequisite: extraction was already working in week six on prompts alone.
Notice what the MVP is not. No fine-tuning, no vector store, no migration of your templates into anyone's library. Factual extraction is mostly a prompting and evaluation problem, so the expensive machinery waits for the phase that needs it.
With the use cases mapped, a small build your team owns beats both bad options. It fits your paper because it was shaped on it. It is auditable because you built the evaluation. It costs a fixed build plus cents per contract, not six figures a year. And the later phases, the learning loop and eventually a model tuned on your own contract history, are a moat that accrues to you.
None of this means buying is always wrong. Buy the commodity plumbing. And if a platform proves itself on your tail paper with published numbers, take it seriously. The point is to make that call per use case, from a map, instead of once, from fear.
The proof
See the small build, costed and measured.
The demo's “How this scales in your org” tab walks these same tiers and this same roadmap, with the live cost and accuracy numbers attached. It is the small build this article argues for, running.
Twenty synthetic contracts, no sign-up, and the accuracy numbers are published, gaps included.
Prefer to talk first? Book a 20-minute fit call.
Related questions
For your engineer: the full technical build and evaluation →
For your CIO: the decision-maker’s view of the risk →
Earlier stage? Run any workflow through the Litmus Test →