I did not set out to run a benchmarking study. I was building a RAG system for policy documents, running it locally on a CPU-only machine with 8GB RAM, and things kept failing in ways that courses and tutorials never warned me about. Each failure taught me something. This article is those lessons — in the order I learned them, the hard way.
Same hardware. Same documents. Three very different results.
Three models running locally via Ollama: DeepSeek 4b with reasoning mode, Qwen 1.5b, and Qwen 7b. One ChromaDB vector store with paragraph-chunked policy documents. Top-K set to 5 — meaning every query retrieved 5 paragraph chunks and passed them all to the model as context.
I was not trying to find the best model for general use. I was trying to find the best model for this specific task: answering questions about policy documents, on constrained hardware, with real retrieved context — not a toy prompt.
── MODELS TESTED
Impressive in demos. Unusable on CPU.
DeepSeek's reasoning mode is compelling — you can watch the chain of thought unroll before the answer arrives. On a GPU, that is a feature. On a CPU with 8GB RAM, it is a liability.
Average throughput: 3–5 tokens per minute. A 200-token answer took between 40 and 60 minutes. No user will wait an hour for a policy answer. That is not a performance problem — it is a product-ending problem.
But speed was not the worst of it. Under top-K of 5, DeepSeek's coherence degraded. It would blend facts from different chunks — sometimes correctly, sometimes not — producing answers that sounded right but introduced numbers or conditions that were not in any of the retrieved documents. Subtle hallucination. The hardest kind to catch.
Faster throughput. Different failure mode.
Qwen 1.5b was faster — 7 to 9 tokens per minute — and for small contexts it performed reasonably. The problem emerged specifically with top-K of 5. Passing 5 paragraph chunks to a 1.5b parameter model is asking it to reason across more context than it can reliably hold.
The failure mode was different from DeepSeek. Qwen 1.5b did not blend facts — it truncated. It answered based on the first 1 or 2 chunks and ignored the rest, even when the most relevant information was in chunk 4 or 5. The retrieved context window exceeded the model's effective reasoning capacity, and it defaulted to what it saw first.
11–12 tokens/min. Coherence held.
Qwen 7b ran at 11 to 12 tokens per minute on the same CPU-only hardware. A 200-token answer in roughly 17 minutes — slow by API standards, but workable for a local system where privacy matters more than speed.
More importantly, coherence held under full top-K context. The model could reason across all 5 retrieved chunks, synthesize information from multiple paragraphs, and produce answers that stayed grounded in the retrieved content. Hallucination dropped dramatically compared to the smaller models.
Why top-K matters more than people think
Top-K is the number of retrieved chunks passed to the LLM as context. Higher top-K means more retrieved information — which sounds better. But there is a model capacity ceiling. Below it, more context improves answers. Above it, the model starts losing coherence, dropping chunks, or blending facts incorrectly.
The right top-K is not a universal constant. It is a function of your model's capacity, your chunk size, and your document type. For my system — paragraph chunks averaging 150–200 words, policy documents, Qwen 7b — top-K of 5 was the practical ceiling before quality started degrading.
The one no benchmark tells you: what does a wrong answer look like?
Token speed and coherence counts are useful. But the most important experiment I ran was not quantitative. I manually read 20 answers, traced each one back to its citation, opened the source document at the cited lines, and asked: did the model stay inside the retrieved context, or did it bring in something from outside?
Smaller models failed this test more often. Not always — sometimes they were perfectly accurate on simple queries. But on complex multi-clause policies with conditions and exceptions, they would occasionally introduce a number or qualifier that was not in the cited lines. Without citations, you would never know. You would just trust the confident-sounding answer.
Three decisions I'd make from day one
Start with 7b minimum
Do not optimise for speed before you know your model can hold the context load. Start with a model that can handle your top-K reliably.
Citations from line one
Not as a feature — as infrastructure. Every chunk that touches the LLM should carry its source metadata from the very beginning.
Measure token speed early
On local hardware, the difference between 5 and 12 tokens/min is the difference between a usable tool and an unusable one.
Code on GitHub · Fluent-Graph on Blog · Built by Eshwar Prasad Yaddanapudi