Engineering

Chat with your documents is easy. Until you have a lot of documents.

Dušan Dević

May 1, 2026

6 min read

"Chat with your documents" stopped being a defensible product two years ago. Anyone can wire an embedding model to a vector store, drop fifty PDFs into a folder, and ship a working demo by Friday. The demo lands, the room nods, everyone agrees this is solved.

Then the corpus grows. The first ten thousand documents go in and accuracy drifts a little. The first hundred thousand, and it drifts a lot. By the time the system is the actual document store of a real company — millions of chunks across product specs, contracts, support tickets, code, knowledge base articles — the same architecture is returning confident, fluent, plausible-sounding, wrong answers. Worse, the people who built it cannot tell you why.

Why scale breaks naive RAG

The intuition most people carry is "more data, more knowledge." For vector retrieval, that is the wrong intuition. Embeddings live in a fixed-dimensional space — usually 768 to 3,072 dimensions, normalized to a unit hypersphere. That space is bounded. The number of meaningfully distinct directions inside it is finite.

Push enough vectors into the same sphere and they crowd together. The cosine similarity between your true top hit and a near-miss compresses. At small scale, the right answer comes back at similarity 0.92 and the next-best at 0.71 — easy to discriminate. At large scale, the top ten results all sit between 0.83 and 0.86. The retriever has lost the resolution it needs to pick the right one. The LLM gets handed five plausible but mostly-wrong chunks and synthesizes a confident wrong answer out of them.

You can see this happening in production by plotting the gap between the #1 and #10 similarity scores over time as the corpus grows. The gap shrinks. Past a certain point, your retriever is effectively returning any ten chunks that are vaguely related to the query, and your eval scores quietly slide.

What it looks like from the user side

Users do not say "your top-k cosine similarities have collapsed." They say things like:

"It used to feel sharp, now it feels generic."
"It cited the wrong document with full confidence."
"It is making things up that sound right."
"It cannot find this exact policy I know is in there."

Each of those reports is a symptom of the same underlying problem: the vector index is no longer a good index, because at the scale of your data it cannot rank precisely enough.

What actually fixes it

Naive RAG is a single-stage system: embed the query, fetch top-k, stuff into the prompt. The fix is to stop being single-stage. The architectures that hold up at real scale share a few patterns:

Hybrid retrieval — combine sparse (BM25) and dense (vector) scoring with reciprocal rank fusion. Sparse catches rare exact terms; dense catches meaning. Together they discriminate where each fails alone.
Cross-encoder reranking — fetch 100 candidates with a fast bi-encoder, then rerank to the true top 5 with a slower cross-encoder. The reranker has the resolution the bi-encoder lost.
Hierarchical / agentic retrieval — first route the query to a narrow slice of the corpus (tenant, product, document class), then retrieve inside that slice. A million chunks become ten thousand become five.
Hard metadata filters — never retrieve across the whole index when you know the query belongs to one customer, one product, one date range. Cheap, deterministic, drastically improves precision.
Query rewriting and decomposition — let the model rewrite vague queries into precise ones, or split a multi-part question into independent retrievals.
Domain-tuned or fine-tuned embeddings — generic embedding models compress more sharply than they discriminate inside your specific domain. Fine-tuning the embedder on your own click-through data is one of the highest-ROI things a mature RAG team does.

None of these are exotic. All of them require engineering, eval pipelines, and someone who has done it before at the scale you are at. There is no library you install that handles this for you.

If your RAG felt sharp at launch and feels mediocre now, this is almost certainly what is happening.

Talk to us

We have shipped retrieval systems against corpora ranging from a few thousand documents to tens of millions of chunks, across multiple tenants and languages. We know which of the patterns above your codebase actually needs, and which are not worth the engineering. If your "chat with your documents" started strong and is quietly degrading — or you are about to launch one and want to do it right — we would like to take a look.

Book a 15-minute call or email info@deltadigit.rs. Bring your eval set, or your nagging suspicion that something is off — both work.

#RAG
#embeddings
#AI

// about the author

Dušan Dević

Founder · DeltaDigit. We design, build, and operate production software for ambitious teams across the EU and US.

Book a 15-min call Message on WhatsApp

—

Keep reading

Chat with your documents is easy. Until you have a lot of documents.

Why scale breaks naive RAG

What it looks like from the user side

What actually fixes it

Talk to us

Dušan Dević

More from the workshop

Beyond the chatbot: AI agents that actually do things

AI won't replace software engineers — it will reshape the job