Every RAG tutorial I found started the same way: pip install langchain. And then the magic happened — a few method calls, a vector store, and suddenly you had a "production RAG system." Except you didn't. You had a wrapper around a wrapper around a wrapper, and when something went wrong, you had no idea where to look.
So I built one from scratch. No LangChain. No LlamaIndex. Just Python, Sentence-Transformers, ChromaDB, and a local LLM running on my laptop. Here is what building it from the ground up taught me that no tutorial covered.
Why most RAG demos don't prepare you
You follow the tutorial.
It works perfectly.
Three API calls, a vector store, a retriever. The demo answers correctly. You ship it. Then it fails on real documents in ways you can't debug because you don't know what's inside the abstraction.
You build from scratch.
You break everything.
Your chunker splits a policy clause mid-sentence. Your model hallucinates on large contexts. You fix both — and now you understand every layer because you wrote every layer.
Two pipelines. Every stage visible.
── RETRIEVAL + GENERATION (every query)
── OUTPUT
There are two pipelines that run at different times. Ingestion happens once. Retrieval and generation happen on every query. Understanding that separation is the first thing frameworks hide from you. When you write it yourself, the boundary is a function call — and you know exactly what crosses it.
The system — and what I left visible on purpose
The system reads plain text policy documents, chunks them into paragraphs with 2-sentence overlap, generates embeddings using all-MiniLM-L6-v2, stores them in a persistent ChromaDB instance, and answers natural language questions by retrieving the most relevant chunks and passing them to a local LLM via Ollama.
I left commented logic, print statements, and intermediate variables in the code deliberately. Comments like "this is valid only for the very first para we read from the file" are in there on purpose — they show the edge cases I found while building, not polished away to look tidy.
Paragraph-based
Blank line = end of semantic unit. Preserves meaning that fixed-size windows destroy.
2-sentence prefix
Last 2 lines of each chunk prefix the next. Helps the LLM — not retrieval scores.
File + line range
Every answer includes filename, start line, end line. Auditable and debuggable.
What benchmarking three local models revealed
The failure modes were different for each model. DeepSeek blended facts across chunks — subtle hallucination, confident tone. Qwen 1.5b truncated — it answered from the first chunk and ignored the rest. Qwen 7b held coherence across all 5 retrieved chunks. Same hardware, same documents, very different results.
Citations are your debugging tool — not just a trust signal
I added citations for user trust. What I discovered quickly is that they are also the most powerful debugging instrument in the pipeline. When an answer is wrong, I know exactly where to look — open the document, read the cited lines, ask: was the answer actually in that chunk? Was it too large? Was the answer split across a boundary?
What works — and what's next
The system works. But it isn't fully production-ready, and I'm honest about that in the README. Three things are missing: an evaluation harness that scores faithfulness programmatically, ingestion idempotency so re-running doesn't create duplicate chunk IDs, and a distance threshold filter so low-confidence retrievals don't pollute the prompt.
- ✓ Paragraph chunking with 2-sentence overlap
- ✓ Citation-grounded retrieval with line tracking
- ✓ Local LLM benchmarking across 3 models
- ○ LLM-as-judge evaluation harness — next
- ○ Ingestion idempotency via content hashing — next
- ○ Distance threshold filtering — next
Full code and README on GitHub. Built to show how RAG works — not hide it.