Most RAG tutorials show you how to get an answer. Very few show you how to get a correct answer — and almost none explain what causes a wrong one. After building a multi-document RAG system from scratch, I ran into three problems no tutorial warned me about: chunking destroying meaning, quantized models hallucinating under load, and token speed making some models unusable on real hardware.
This is what I learned — and what I would do differently from day one.
Fixed-size chunking destroys meaning
The default chunking strategy in most tutorials is fixed-size — split every 512 tokens, maybe with a small overlap. It is simple, predictable, and fast. It is also wrong for documents where meaning lives inside a continuous clause.
I was chunking policy documents. A policy clause like "employees are entitled to reimbursement up to $500 provided the expense was pre-approved by a line manager and falls within the categories defined in Appendix B" — split at a fixed boundary — becomes two useless half-sentences. Neither chunk retrieves well. Neither answers the question correctly.
── PARAGRAPH-BASED
My solution: chunk by paragraph boundary. A blank line in the document signals the end of a semantic unit. When the chunker hits two consecutive newlines, it closes the current buffer and stores the chunk. The result is chunks that map to human-readable ideas — not arbitrary token counts.
Overlap is for the LLM — not for retrieval
Most tutorials that mention overlap describe it as improving retrieval — the idea being that repeated content increases the chance a chunk matches a query embedding. That is not quite right. Overlap does not meaningfully change cosine similarity scores. What it actually does is help the LLM.
When your retriever returns chunk 7, that chunk starts cold. The LLM has no idea what was in chunk 6. If a clause spans a paragraph boundary — condition in paragraph 6, consequence in paragraph 7 — the LLM reading only chunk 7 will generate an incomplete or wrong answer.
Chunk 7 starts cold
LLM misses the condition set in paragraph 6. Answer is incomplete or fabricated from training data.
Chunk 7 has context
Last 2 lines of paragraph 6 prefix chunk 7. LLM has continuity. Answer is grounded and complete.
I added the last 2 sentences of each paragraph as a prefix to the next chunk. This is not retrieval improvement — it is generation improvement. Keep that distinction clear, because it changes how you would measure whether it is working.
Quantized models hallucinate under large top-K
I tested three local models: DeepSeek 4b, Qwen 1.5b, and Qwen 7b. On small top-K contexts, all three performed reasonably. The problems emerged when I increased top-K to 5, passing more retrieved context to the model.
Smaller quantized models started hallucinating — not dramatically, but in the subtle way that is hardest to catch: confident answers that almost matched the policy document but introduced a number or condition that was not there. The model could not hold the full context and started filling gaps from its training data instead of the retrieved chunks.
DeepSeek with its reasoning mode was particularly slow — 3 to 5 tokens per minute on CPU-only hardware. For a single query that might mean waiting 40 to 60 minutes for an answer. That is not a performance problem — it is a product-ending problem. Qwen 7b at 11 to 12 tokens per minute was the only model that balanced speed and coherence at top-K of 5.
Citations as your debugging loop
Every answer in my system includes a citation: the filename, the start line, and the end line of every chunk that contributed to the response. I added this for user trust. What I quickly realized is that citations are also the most powerful debugging tool in the entire pipeline.
If the citation points to lines 42–67 and the answer is wrong, I know exactly where to look. Open the document, read those lines, ask: was the answer actually in that chunk? Was it too large? Was the answer split across a boundary?
Full code on GitHub — with comments left in on purpose, so you can see the thinking.