RAG · Retrieval · Search

RAG is a Search Problem. Build It Like One.

A blueprint for retrieval systems that survive production.

May 15, 2026 · 8 min read · Louis Ulmer

Most teams reach for a framework, inherit its opinions, and spend the next six months fighting them. This is a blueprint for the opposite approach: a small system you fully understand, where every retrieval decision is yours to make, and yours to measure.

Seven sheets. No framework required.

1 - The framework tax

The first RAG demo is easy, and frameworks make it easier. The hundredth answer is the hard part, and the question production keeps asking is brutally simple: when a result is wrong, can you find out why, and can you change it?

With most RAG frameworks the honest answer is no. The retrieval logic is buried under abstractions tuned for a different model than yours, usually a hosted one. Swap in an open-weights model and the chain quietly degrades in ways you can’t see. Add a feature and you find yourself reading framework internals instead of writing your own. You are at the mercy of framework updates too.

None of this is an argument against good libraries. Use a vector store. Use an embedding server. Use a battle-tested HTTP client. The argument is narrower and sharper: don’t hand away the one part of the system that decides whether it works, which is retrieval.

Rule of thumb. Own your retrieval; rent everything else. The best framework, for the part that matters, is no framework.

2 - The blueprint

Fig. 1 — Two loops meet at one index; the language model appears once, at the very end.

Strip RAG down and only two loops remain.

Offline, you turn documents into something searchable: chunk them, enrich them, and write them into an index.
Online, you turn a question into a ranked set of passages, then into an answer.

The two loops meet at exactly one place: the index.

Notice where the language model sits: once, at the very end. By the time it sees a token, every decision that determines answer quality has already been made upstream. Generation is close to a commodity now. Retrieval is where systems are won and lost, so that is where the rest of this blueprint lives.

3 - Retrieval is two kinds of matching

A question matches a passage by meaning, or by words. You need both.

Dense embeddings capture meaning: “how do I reset my credentials” finds a passage about “rotating API keys” even with no shared words. That generalization is also their weakness, because embeddings blur exact tokens. Names, error codes, SKUs, the literal string RAFT (the things a user types when they know precisely what they want) are where dense search misses.

Sparse, lexical matching is the complement. It rewards exact terms. There’s a ladder of maturity, and you can climb as far as your latency budget allows:

Model	What it is	When
TF-IDF	Term counts weighted by rarity.	The baseline: old, cheap, simple.
BM25	TF-IDF with saturation and length normalization. Still the workhorse of lexical search, superb at rare identifiers.	Worthwhile improvement.
SPLADE	A transformer that predicts a sparse vocabulary vector: learned lexical matching with term expansion, so “k8s” can light up “kubernetes.”	When BM25 plateaus.

Run dense and sparse in parallel, then merge with Reciprocal Rank Fusion: ignore the raw scores, combine by rank position. RRF is robust precisely because it refuses to trust two incomparable score scales. Tune the weighting. A sensible starting point is roughly 80% semantic, 20% lexical, then move it with your own numbers.

The payoff is not subtle. Anthropic reported that prepending a short, document-aware context to each chunk and adding contextual BM25 cut retrieval failures by about half; layering a reranker on top pushed the reduction further still. Hybrid isn’t a nicety. It closes the gap two methods leave open in different places.

4 - Rank it further

Fig. 2 — The set shrinks left to right while cost per item climbs; spend compute where the set is small.

Recall is cheap. Precision is worth paying for.

First-stage retrieval is recall-oriented and fast: a bi-encoder embeds the query and every passage separately, then compares vectors. It’s fast because the passages were embedded long ago, but it never lets the query and a passage truly look at each other.

A reranker does. A cross-encoder reads the query and a candidate passage together, in one forward pass, and scores their relevance with full attention between them. It’s far more expensive per pair, which is exactly why you don’t run it on the whole corpus. The pattern: retrieve ~150 cheaply, rerank down to the 5–10 you actually show the model.

Late interaction sits between the two. ColBERT keeps a vector per token and scores with MaxSim: most of the precision of a cross-encoder, much of the speed of a bi-encoder. The idea now reaches past text. ColPali ranks document pages as images, and listwise LLM rerankers (Qwen3, jina-reranker-v3) read a query against many documents in one shared context window. Different points on the same speed-versus-precision curve, so pick by your latency budget.

5 - Make documents findable before you search them

Most “bad search” is bad indexing wearing a disguise.

A retriever can only return what you indexed well. Before you tune a single ranking parameter, look at what’s in the index; the failure is usually there.

Chunk on meaning, not character counts. A fixed 512-token window will happily slice a table in half or strand a definition from the sentence that uses it. Semantic chunking respects structure, so each chunk is a self-contained idea.

Give the model a job at index time. It’s cheap offline and compounding online: have it write a one-line description of each chunk in the context of its parent document, extract a field of keywords, and generate metadata. A chunk that begins “From the Q3 refund-policy memo:” is dramatically easier to retrieve than the same orphaned paragraph.

Then filter before you search. Store that metadata as structured fields, and let the model translate a question into a boolean expression over them, like type=invoice AND year>=2024, applied before vector search. Narrow the haystack first; the needle gets a lot easier to find.

6 - If you can’t measure it, you’re guessing

This is the sheet everyone skips, and the one that separates a demo from a product.

Everything above is a knob. Without a number, turning knobs is superstition. So build a small evaluation set first: real questions paired with the passages you know should answer them. A few dozen, curated by hand, beats a thousand you don’t trust.

Measure retrieval and generation separately. They fail for different reasons and you need to know which.

Retrieval: Recall@k and MRR tell you whether the right passage even made it into the context.
Generation: faithfulness and answer relevance tell you whether the model then used it honestly.

Confusing the two sends you tuning the wrong knob for a week.

Now every change, whether a new embedding model, a different chunker, or a reranker, becomes a number that moves, not a vibe. Your eval procedure is the only thing that can tell you a swap improved the pipeline.

7 - When retrieval isn’t enough, teach the model

Fine-tune last, but know the lever is there.

Sometimes the passage is right there in context and the model still fumbles it, because your domain is unusual or the right answer is buried among near-misses. That’s a generation problem, and retrieval tuning won’t fix it.

RAFT, retrieval-augmented fine-tuning, addresses it directly. You train on examples of (question, a few golden documents mixed with distractor documents, a chain-of-thought answer that quotes the relevant passage verbatim). The model learns to read its context, cite what matters, and ignore the noise, and it stays robust when the number of retrieved documents varies at test time.

It’s the last lever, not the first. Exhaust retrieval and ranking quality before you reach for it: fine-tuning is slower, costs a training loop, and trades away the thing RAG gives you for free, which is the ability to update knowledge by editing the index, not retraining the model.

The seven points to remember

Seven things, if you remember nothing else.

Treat RAG as search, not as a wrapper around an LLM. The LLM is just a part, not the system.
Own your retrieval Abstractions over the part that decides correctness will cost you later.
Hybrid is the default. Meaning and words miss in different places, so running both covers the seam.
Spend compute at rank time, not retrieval time. Recall is cheap; precision is worth paying for.
The index is a product. Most bad answers trace back to a bad chunk, not a bad model.
Build the eval before the features. You can’t improve a number you don’t have.
Fine-tune last. Keep the freedom to fix knowledge by editing the index.

↗ Annex · the reading I’d actually keep open

Contextual Retrieval: context-aware chunks + BM25, with the numbers (hybrid search)
Sparse Vectors in Qdrant: pure-vector hybrid search (sparse · SPLADE)
MTEB Leaderboard: pick an embedding model on evidence (embeddings)
Training a cross-encoder: build your own reranker (reranking)
ranx: fast ranking evaluation, comparison & fusion (evaluation · RRF)
GritLM: generative + representational instruction tuning (embedding training)
RAFT (arXiv:2403.10131): adapting a model to domain-specific RAG (fine-tuning)
RAG fine-tuning in practice: a code-assistant walkthrough (RAFT · applied)

Drafted for builders . End of sheet.