Executive Summary
Standard RAG breaks down on long documents because it loses context. We use hierarchical chunking, citation graphs, and multi-query retrieval to maintain accuracy across 500+ page contracts.
Standard RAG (Retrieval-Augmented Generation) is everywhere. Every AI startup claims to “chat with your documents.” But here’s what they don’t tell you: RAG breaks down spectacularly on long, complex documents.
We learned this the hard way.
The Problem: Context Windows Are a Lie
When a legal team asks “What is the liability cap in this MSA?”, a naive RAG system will:
- Chunk the document into 500-token pieces
- Embed each chunk
- Find the most similar chunks to the query
- Pass those chunks to an LLM
This works fine for a 10-page document. But what happens with a 500-page Master Services Agreement?
The Cross-Reference Problem
Real contracts don’t work like blog posts. A liability clause on page 87 might say:
“Notwithstanding Section 12.4, the aggregate liability shall not exceed the amounts set forth in Exhibit B, as modified by Amendment 3.”
To answer “What is the liability cap?”, your system needs to:
- Find Section 12.4 (page 23)
- Find Exhibit B (page 412)
- Find Amendment 3 (a separate document)
- Synthesize all of this into a coherent answer
Standard RAG retrieves 3-5 chunks and hopes for the best. It has no concept of document structure, cross-references, or amendments.
Our Approach: Hierarchical Document Intelligence
After months of iteration with legal teams processing thousands of contracts, we developed a different approach.
1. Hierarchical Chunking
Instead of naive fixed-size chunking, we parse documents into their natural structure:
Document
├── Front Matter (Title, Parties, Effective Date)
├── Article I: Definitions
│ ├── 1.1 "Affiliate"
│ ├── 1.2 "Confidential Information"
│ └── ...
├── Article II: Services
│ ├── 2.1 Scope
│ ├── 2.2 Service Levels
│ └── ...
└── Exhibits
├── Exhibit A: Pricing
└── Exhibit B: Liability Caps
Each level gets its own embedding. When we retrieve, we pull the relevant section AND its parent context.
2. Citation Graphs
Every chunk maintains pointers to:
- What it references (“See Section 12.4”)
- What references it (other clauses that point here)
- Related amendments and exhibits
When answering a question, we traverse this graph to gather all relevant context—even if it’s semantically distant.
3. Multi-Query Retrieval
For complex questions, a single query often misses relevant chunks. We decompose questions into sub-queries:
Original: “What are our termination rights if the vendor breaches SLA?”
Decomposed:
- “termination rights” → Article XV
- “vendor breach” → Article XII
- “SLA requirements” → Exhibit C
- “breach definition” → Article I Definitions
We retrieve for each sub-query, deduplicate, and pass the union to the LLM.
Results: 72% → 94% Citation Accuracy
On our internal benchmark of 200 legal questions across 50 contracts:
| Metric | Naive RAG | Our Approach |
|---|---|---|
| Citation Accuracy | 72% | 94% |
| Cross-Reference Resolution | 31% | 89% |
| Latency (p95) | 2.1s | 3.4s |
Yes, we trade some latency for accuracy. For legal teams, that’s the right trade.
What This Means for Enterprise Document AI
If you’re evaluating document AI solutions, ask these questions:
- How do you handle cross-references? If the answer is “we chunk and embed,” run.
- Can you show me the exact source? Vague answers mean hallucinations.
- What’s your accuracy on 500+ page documents? Short demos hide long-document failures.
Try It Yourself
We built Briefly to handle exactly these challenges. Upload a complex contract and ask questions that require cross-referencing.
Have questions about our architecture? Reach out at team@briefly-docs.com.
Frequently Asked Questions
Why does RAG fail on long documents?
How does Briefly handle long documents differently?
What accuracy improvements did you see?
Tags
Arjun Mehta
Co-founder & CTO
Building the accuracy layer for high-stakes document workflows at Briefly Docs.