RAG is not a model. It is a pipeline that grounds an LLM in documents it was never trained on. For a Prismatic team member asking βhow does the decision envelope seal work?β, the answer should come from the actual code and docs β not from what the model guessed. Get the pipeline right and the model stops hallucinating about your own codebase. Get it wrong and RAG is worse than no retrieval at all, because confident wrong answers are worse than βI donβt know.β
#The three primitives
- Chunk. Split each source document into passages of ~500 tokens with ~50-token overlap. Smaller chunks miss context; larger chunks dilute the relevance score.
- Embed. Turn each chunk into a vector using a stable model. Store the vector plus the source reference in a vector database.
- Retrieve. At query time, embed the query, fetch the top-k nearest chunks, and pass them to the LLM with the instruction βanswer only from the provided context.β
Thatβs it. Every βRAG frameworkβ adds features on top of these three primitives; the primitives themselves do most of the work.
#The step everyone skips: hybrid search
Pure vector search is bad at rare terms. βWhat does binary_to_term/2 do?β should return the exact function β but an embedding of the question may not put the exact function near the top. The fix is hybrid retrieval: combine BM25-style keyword search (via Meilisearch) with vector search, then rerank. You keep the semantic recall of vectors and the exact-match precision of keywords.
def retrieve(query, k \\ 8) do
keyword_hits = Meilisearch.search("docs", query, limit: k)
vector_hits = VectorStore.nearest(embed(query), limit: k)
(keyword_hits ++ vector_hits)
|> dedup_by(& &1.chunk_id)
|> Enum.sort_by(&score/1, :desc)
|> Enum.take(k)
end#Prompt: constrain the model
The prompt template matters as much as the retrieval:
Answer the question below using ONLY the numbered context passages.
If the context does not contain the answer, say "not found in the docs."
Cite passage numbers in your answer like [1], [2].
CONTEXT:
[1] <chunk>
[2] <chunk>
...
QUESTION: <query>βSay not foundβ is the load-bearing line. Without it the model invents an answer. With it, a bad retrieval fails loudly instead of silently.
#Where RAG is wrong
RAG is wrong when:
- The question requires reasoning over the entire corpus (e.g. βhow many agents are there?β). Retrieval returns 8 chunks; the model cannot count what it does not see.
- The answer changes frequently. Embeddings go stale. Either reindex aggressively or route those queries to a live system.
- The user wants authority. βCan I delete this module?β needs a live dependency analysis, not a retrieved README.
#Where to go next
- Academy: Storage Patterns β where the vector DB fits
- Glossary: RAG, LLM, Embedding, Vector Database, Meilisearch
Three primitives. One hybrid retrieval. One honest prompt. That is RAG.