RAG for Internal Knowledge
Engineering

RAG for Internal Knowledge: When the LLM Should Not Make Things Up

Retrieval-augmented generation sounds like magic. In practice it is three boring primitives β€” chunk, embed, retrieve β€” wired carefully enough that the LLM stops hallucinating about your own codebase.

Apr 09, 2026 Β· 7 min read Β· TomΓ‘Ε‘ Korcak (korczis)

RAG is not a model. It is a pipeline that grounds an LLM in documents it was never trained on. For a Prismatic team member asking β€œhow does the decision envelope seal work?”, the answer should come from the actual code and docs β€” not from what the model guessed. Get the pipeline right and the model stops hallucinating about your own codebase. Get it wrong and RAG is worse than no retrieval at all, because confident wrong answers are worse than β€œI don’t know.”

#The three primitives

  1. Chunk. Split each source document into passages of ~500 tokens with ~50-token overlap. Smaller chunks miss context; larger chunks dilute the relevance score.
  2. Embed. Turn each chunk into a vector using a stable model. Store the vector plus the source reference in a vector database.
  3. Retrieve. At query time, embed the query, fetch the top-k nearest chunks, and pass them to the LLM with the instruction β€œanswer only from the provided context.”

That’s it. Every β€œRAG framework” adds features on top of these three primitives; the primitives themselves do most of the work.

Pure vector search is bad at rare terms. β€œWhat does binary_to_term/2 do?” should return the exact function β€” but an embedding of the question may not put the exact function near the top. The fix is hybrid retrieval: combine BM25-style keyword search (via Meilisearch) with vector search, then rerank. You keep the semantic recall of vectors and the exact-match precision of keywords.

def retrieve(query, k \\ 8) do
  keyword_hits = Meilisearch.search("docs", query, limit: k)
  vector_hits  = VectorStore.nearest(embed(query), limit: k)
  (keyword_hits ++ vector_hits)
  |> dedup_by(& &1.chunk_id)
  |> Enum.sort_by(&score/1, :desc)
  |> Enum.take(k)
end

#Prompt: constrain the model

The prompt template matters as much as the retrieval:

Answer the question below using ONLY the numbered context passages.
If the context does not contain the answer, say "not found in the docs."
Cite passage numbers in your answer like [1], [2].

CONTEXT:
[1] <chunk>
[2] <chunk>
...

QUESTION: <query>

β€œSay not found” is the load-bearing line. Without it the model invents an answer. With it, a bad retrieval fails loudly instead of silently.

#Where RAG is wrong

RAG is wrong when:

  • The question requires reasoning over the entire corpus (e.g. β€œhow many agents are there?”). Retrieval returns 8 chunks; the model cannot count what it does not see.
  • The answer changes frequently. Embeddings go stale. Either reindex aggressively or route those queries to a live system.
  • The user wants authority. β€œCan I delete this module?” needs a live dependency analysis, not a retrieved README.

#Where to go next

Three primitives. One hybrid retrieval. One honest prompt. That is RAG.

Browse all β†’