Insight · Data governance

Where does your contract data actually go?

Counterparty PII and commercial secrets travel with every contract, so the first test of any legal AI tool is whether it can answer, precisely, where the document goes.

The vendor demo is going well. The tool reads a contract, fields appear, the room nods. Then someone from security asks where the document went, and the energy changes. “It’s processed by AI” turns out to be the whole answer. Which model? Retained for how long? Who else touches it? Silence, and a promise to follow up.

The discomfort is earned. A contract is one of the densest stores of sensitive data your company holds: counterparty names and signature blocks, notice contacts with direct emails and phone numbers, negotiated pricing, indemnity positions, termination terms a competitor would happily read. “We send it to an AI” is a data transfer to a third party. It deserves the same scrutiny as any other one.

Four questions should gate every tool, before anyone debates accuracy:

Which model, exactly?

A named provider and model version, not “enterprise-grade AI.”

What is retained, and for how long?

Inputs and outputs both, stated in writing.

Who is the processor of record?

And every sub-processor sitting behind them.

On a breach, who calls whom, and when?

Your 72-hour regulator clock starts regardless.

If a vendor cannot answer all four in plain language, the evaluation should end there. Data governance is not a checkbox at the end of a legal AI project. It is the gate at the front, and in practice it is what decides whether your CIO and CTO sign off at all. A tool can be impressively accurate and still be unshippable, because accuracy can be improved after launch and a data flow nobody will approve cannot.

This question belongs to a set. The vendor who cannot name their processor usually cannot show which clause a number came from either (the traceability problem), and a tool that hides its data path tends to hide its uncertainty too (the honesty problem). Residency is also the first half of the compliance conversation: the second half, where AI at contract review lands under the EU AI Act, is calmer than most teams fear. Residency comes first because it decides viability. Everything else in this series assumes you can get past this gate.

Redact first

Strip the personal data before the model ever sees it

A contract carries two kinds of sensitive content, and they deserve different treatment. The commercial terms, the caps, the dates, the obligations, are the point: the model has to see them to extract them. The personal data is not. Signatory and notice-contact names, direct emails, phone numbers, street addresses, the occasional ID-shaped number: none of it is needed for the work. A liability cap does not depend on who the notice contact is.

So a well-built pipeline strips personal data before a single character leaves your side, replacing each value with a stable token: [PERSON-1], [EMAIL-1]. The document’s structure survives, extraction proceeds exactly as before, and the model provider never receives a name to retain, leak, or produce under subpoena.

One unfashionable detail matters here. The redaction pass should be boring: a rules layer your own engineer can read line by line, not a clever model-based redactor. Redaction is precisely the step an auditor will ask about, and you cannot certify what you cannot inspect. A production system handling messier documents adds a trained named-entity pass on top. That is an accuracy upgrade, not a new architecture, and it still runs on your side of the line.

The boundary is the whole game: the document travels from your network through a redaction gate, and only masked text crosses the line to a hosted model. Everything that identifies a person stays on your side. Here is what the gate does to a real clause, the same rule family and tokens as step 1 of the working demo.

Fig. 1 · The redaction diff, before and after

Notices clause · synthetic MSA · page 12

Notices to Provider shall be sent to Meridian Health Systems, Inc., Attn: Daniel Okafor, VP Commercial, at 2200 Bayshore Boulevard, Suite 400, San Francisco, CA 94124, with a copy by email to d.okafor@meridianhealth.com and by phone to (415) 555-0182 within five (5) business days of the triggering event.

4 PII instances removed before any model call

Daniel Okafor[PERSON-1]

2200 Bayshore Boulevard, Suite 400[ADDRESS-1]

d.okafor@meridianhealth.com[EMAIL-1]

(415) 555-0182[PHONE-1]

Company names stay: legal entities are the extraction target, not personal data. People, emails, phones, and street addresses go. Same rule family as step 1 of the working demonstration, which stores nothing you do there.

Want to watch the redaction run before anything leaves? See this in the demo →

Roles and retention

You stay the controller. The vendor is your processor.

GDPR sorts everyone who touches personal data into two roles. The controller decides why and how personal data is processed. The processor handles it on the controller’s documented instructions. When your contracts flow to a model API, your company is the controller, and the tool vendor and model provider are processors and sub-processors. Sending data to a vendor outsources the work, never the responsibility. If a processor leaks counterparty data, the regulator’s first call is still to you.

Concretely, the split looks like this:

Who owes what: the full controller and processor duties

Duty

You · the controller

Your vendor · the processor

Purpose and lawful basis

You · the controller

Decide why contract data is processed and under what lawful basis. This cannot be delegated, whatever the order form implies.

Your vendor · the processor

Process the data only on your documented instructions, for your purpose, and none of their own.

What goes in

You · the controller

Control what enters the pipeline. Redaction before the model call is your control to demand, not a courtesy to hope for.

Your vendor · the processor

Use nothing beyond the engagement. No training on your inputs unless you agreed to it in writing.

Security and sub-processors

You · the controller

Vet before data flows: DPA signed, sub-processor list reviewed, a lawful transfer mechanism in place for data leaving the EEA.

Your vendor · the processor

Maintain technical measures: encryption in transit and at rest, access controls, audit logs. Notify you before sub-processors change.

Breach response

You · the controller

Notify the regulator within 72 hours, and affected individuals where required. The clock runs on you, not the vendor.

Your vendor · the processor

Tell you without undue delay after becoming aware, with enough detail for you to assess and act.

Retention and deletion

You · the controller

Set the retention rule and verify it is honored. Storing nothing you do not need is the cheapest control available.

Your vendor · the processor

Delete or return the data at the end of the engagement, and demonstrate that it happened.

Retention: get it in writing

Every serious model provider now offers zero-retention or short-window options for API traffic. The defaults vary, the controls move, and the marketing page is not the contract. What you rely on must live in the data processing agreement, not in a reassuring email from the account manager. Four demands, none of them exotic:

Demand in the DPA, not in an email

A stated retention window for inputs and outputs: zero, or a number of days. “As long as needed” is not a window.
No training on your data, written as an obligation, not buried as a default setting someone could change.
Deletion or return of the data at termination, with written confirmation that it happened.
Breach notification without undue delay, to a named contact, with enough detail for you to act on your own 72-hour clock.

CCPA, in one paragraph

California frames the same relationship as “business” and “service provider.” If contract data includes Californians’ personal information, your vendor needs to sign service-provider terms: use limited to the agreed purpose, no selling or sharing, cooperation with deletion requests. A vendor that will not sign is, in CCPA terms, a third party you are disclosing personal information to, which changes your notice obligations and your exposure. The diligence test takes one sentence: ask whether they will sign CCPA service-provider terms, and watch how fast the answer comes.

Deployment

What an on-prem requirement actually changes

Some boards answer the residency question with a wall: nothing leaves the network. That is a legitimate posture, and it is buildable. Be clear-eyed about what it changes, because vendors mumble here. The frontier hosted models are off the table. On your own hardware you run open-weight models, usually smaller, often quantized: compressed to fit on affordable GPUs, at some cost to precision.

Smaller model, lower accuracy on the hard extractions. That is not an argument against on-prem; it is a design input. An on-prem system leans harder on confidence gating and the human review queue: more fields routed to a person, fewer trusted automatically. The honest framing for your CIO is a dial, not a binary. Each step toward full control trades away some accuracy and adds some operations. A vendor who tells you on-prem “works exactly the same” has not measured the difference.

Fig. 2 · The residency dial: three deployments, three trades

Most accuracy, least to operateMost control

Hosted model API

Where the document goes: Redacted text leaves your network to a vetted processor, under a DPA you negotiated.
What you can run: Frontier models. The strongest extraction accuracy available.
The trade: Control lives in the contract: retention window, training exclusion, sub-processor list.

Private cloud (VPC)

Where the document goes: Documents stay inside your cloud tenancy. The model comes to the data.
What you can run: Large open-weight models on GPUs you rent.
The trade: More control, more infrastructure to operate, a modest accuracy cost.

Fully on-prem

Where the document goes: Nothing leaves the building. No external processor in this flow at all.
What you can run: Smaller or quantized open-weight models on hardware you own.
The trade: Maximum control. A real accuracy cost, so the review queue carries more weight.

In practice

Done wrong, done right

Strip away the acronyms and the residency question reduces to two postures you will recognize from vendor conversations.

Done wrong

Raw contracts uploaded to a tool nobody vetted. No redaction pass. No DPA, no sub-processor list, retention “probably fine.” Every upload is an unlogged disclosure of counterparty personal data, and the first time anyone can say where the documents went is during the breach post-mortem.

Done right

Personal data stripped before any model call, by a pass your own team can read. A signed DPA naming the processor and its sub-processors. A retention window you chose, training use excluded in writing, deletion confirmed at exit. Nothing stored that the workflow does not need.

The proof

Watch the personal data get stripped first.

Redaction before the model is step 1 of the working demonstration on this site, running before classification, extraction, or anything downstream. Pick a synthetic contract and watch the before-and-after diff, every replaced value listed. The demonstration also keeps its own residency promise: it stores nothing you do there.

See an example: extracting key terms, done right

Twenty synthetic contracts, no sign-up, and nothing you do in the demonstration is stored.

Prefer to talk first? Book a 20-minute fit call.