engineering Dec 15 ' 24 • 8 min read

Why RAG is Not Enough for 500-Page Contracts

A

Arjun Mehta

Co-founder & CTO

Executive Summary

Standard RAG breaks down on long documents because it loses context. We use hierarchical chunking, citation graphs, and multi-query retrieval to maintain accuracy across 500+ page contracts.

Standard RAG (Retrieval-Augmented Generation) is everywhere. Every AI startup claims to “chat with your documents.” But here’s what they don’t tell you: RAG breaks down spectacularly on long, complex documents.

We learned this the hard way.

The Problem: Context Windows Are a Lie

When a legal team asks “What is the liability cap in this MSA?”, a naive RAG system will:

  1. Chunk the document into 500-token pieces
  2. Embed each chunk
  3. Find the most similar chunks to the query
  4. Pass those chunks to an LLM

This works fine for a 10-page document. But what happens with a 500-page Master Services Agreement?

The Cross-Reference Problem

Real contracts don’t work like blog posts. A liability clause on page 87 might say:

“Notwithstanding Section 12.4, the aggregate liability shall not exceed the amounts set forth in Exhibit B, as modified by Amendment 3.”

To answer “What is the liability cap?”, your system needs to:

  • Find Section 12.4 (page 23)
  • Find Exhibit B (page 412)
  • Find Amendment 3 (a separate document)
  • Synthesize all of this into a coherent answer

Standard RAG retrieves 3-5 chunks and hopes for the best. It has no concept of document structure, cross-references, or amendments.

Our Approach: Hierarchical Document Intelligence

After months of iteration with legal teams processing thousands of contracts, we developed a different approach.

1. Hierarchical Chunking

Instead of naive fixed-size chunking, we parse documents into their natural structure:

Document
├── Front Matter (Title, Parties, Effective Date)
├── Article I: Definitions
│   ├── 1.1 "Affiliate"
│   ├── 1.2 "Confidential Information"
│   └── ...
├── Article II: Services
│   ├── 2.1 Scope
│   ├── 2.2 Service Levels
│   └── ...
└── Exhibits
    ├── Exhibit A: Pricing
    └── Exhibit B: Liability Caps

Each level gets its own embedding. When we retrieve, we pull the relevant section AND its parent context.

2. Citation Graphs

Every chunk maintains pointers to:

  • What it references (“See Section 12.4”)
  • What references it (other clauses that point here)
  • Related amendments and exhibits

When answering a question, we traverse this graph to gather all relevant context—even if it’s semantically distant.

3. Multi-Query Retrieval

For complex questions, a single query often misses relevant chunks. We decompose questions into sub-queries:

Original: “What are our termination rights if the vendor breaches SLA?”

Decomposed:

  • “termination rights” → Article XV
  • “vendor breach” → Article XII
  • “SLA requirements” → Exhibit C
  • “breach definition” → Article I Definitions

We retrieve for each sub-query, deduplicate, and pass the union to the LLM.

Results: 72% → 94% Citation Accuracy

On our internal benchmark of 200 legal questions across 50 contracts:

MetricNaive RAGOur Approach
Citation Accuracy72%94%
Cross-Reference Resolution31%89%
Latency (p95)2.1s3.4s

Yes, we trade some latency for accuracy. For legal teams, that’s the right trade.

What This Means for Enterprise Document AI

If you’re evaluating document AI solutions, ask these questions:

  1. How do you handle cross-references? If the answer is “we chunk and embed,” run.
  2. Can you show me the exact source? Vague answers mean hallucinations.
  3. What’s your accuracy on 500+ page documents? Short demos hide long-document failures.

Try It Yourself

We built Briefly to handle exactly these challenges. Upload a complex contract and ask questions that require cross-referencing.

Book a Demo →


Have questions about our architecture? Reach out at team@briefly-docs.com.

Frequently Asked Questions

Why does RAG fail on long documents?
RAG retrieves chunks based on semantic similarity, but long documents have complex cross-references and context dependencies that get lost when split into small chunks. A clause on page 450 might reference definitions from page 5.
How does Briefly handle long documents differently?
We use hierarchical chunking (document → section → paragraph), maintain a citation graph for cross-references, and use multi-query retrieval to gather context from related sections before generating answers.
What accuracy improvements did you see?
On our legal document benchmark, hierarchical chunking improved citation accuracy from 72% to 94% compared to naive RAG approaches.

Tags

rag document-ai legal-tech enterprise
A

Arjun Mehta

Co-founder & CTO

Building the accuracy layer for high-stakes document workflows at Briefly Docs.

Ready to try Briefly?

See how we handle your most complex documents.

Book a Demo