Node.js everyday

AI FluentGraph — Semantic Intelligence Layer for ServiceNow Fluent & LLM Systems

February 15, 2026 · YES PRASAD

Open Source · MIT License · ServiceNow Fluent SDK 4.x.x

Map every table, script, and relationship across your Fluent application. Know the full blast radius of any change — before it reaches production.

FluentGraph is an semantic layer that enables LLMs to safely understand, reason about, and interact with ServiceNow Fluent systems using structured context instead of raw code.

Get Started Free View on GitHub

▶ MCP compatible ⏱ One-command analysis ⚙ Zero configuration ▶ CI/CD compatible

fluent-graph analyze — ITSM Application

$ fluent-graph analyze ── Scanning project artifacts... 📦 DATA SCHEMA ───────────────────────────────────── Table Source File Status x_1566_seventh_account tables/account.ts ✓ active x_1566_seventh_customer tables/customer.ts ✓ active x_1566_seventh_risk_score tables/risk_score.ts ✓ active x_1566_seventh_fraud_alert tables/fraud.ts ✓ active 🔗 SCHEMA RELATIONSHIPS ──────────────────────────── Source Field → Target Table Type account.owner → x_1566_seventh_customer reference fraud_alert.account → x_1566_seventh_account reference critical_account.team → sys_user_group reference ⚙️ LOGIC ATTACHMENTS ────────────────────────────── Artifact Table Trigger State bank_incident_onload incident [onLoad] active cs_incident_onload incident [onLoad] active cs_incident_onchange_caller incident [onChange] active cs_account_onchange_type x_1566_account [onChange] active Auto-close empty Changes change_request [action] review ✓ Analysis complete — 9 tables · 8 relationships · 9 scripts Output written to fluent-graph.json

The Problem

Production breaks are preventable

ServiceNow Fluent applications grow quickly. What begins as a handful of clean tables becomes a dense network of scripts, business rules, and cross-references. Without visibility into those connections, every refactor is a gamble.

Scenario A

You delete a field.
Production breaks.

Fourteen client scripts fail. Three business rules throw errors. Two dashboards go blank. You discover the full scope — at runtime, in production.

Error: Field 'caller_id' not found at ClientScript: cs_incident_onload at ClientScript: bank_incident_onload + 12 more...

Scenario B

A new developer joins.
No one knows the app.

Understanding how tables interconnect, which scripts are active, and why certain business rules exist takes days of exploration — time that should be spent shipping features.

Root Cause

ServiceNow Fluent needs a dependency graph

The most critical question before any change: what else does this touch? FluentGraph is that answer — a free, one-command tool that draws a complete map of your project.

Capabilities

Everything your team needs
to ship with confidence

FluentGraph surfaces the architectural intelligence already embedded in your project — structured, queryable, and ready for automated pipelines.

Artifact Lineage

A complete inventory of every table, field, client script, business rule, UI action, and ACL — each mapped to its source file and active state.

Schema Relationships

Every foreign-key reference and table extension within your app, including links to platform tables like sys_user and incident.

Blast Radius Analysis

Run a single command to enumerate every artifact affected if you modify or remove a target table. Know the full impact before the first line changes.

CI/CD Integration

Outputs a machine-readable fluent-graph.json artifact purpose-built for automated pipeline checks. Intercept conflicts before the merge queue.

Logic Attachment Mapping

Associates every onLoad, onChange, and onSubmit script with its owning table, trigger type, and active state — making dormant code visible at a glance.

Rapid Onboarding

New engineers understand the full application architecture in minutes rather than days. The entire dependency graph is a single command away.

Output

Three sections.
Complete visibility.

Run fluent-graph analyze and get a clean, readable report divided into three sections — plus a machine-readable JSON file for your pipelines.

01 — DATA SCHEMA

Your table inventory

A clean list of every custom table alongside its source file and active status.

Table name and scope
Source file location
Active / deleted status
Visibility attributes

02 — SCHEMA RELATIONSHIPS

How your tables talk

Every reference field connection — including links out to platform tables outside your scope.

Source table and field name
Target table (local or platform)
Reference type and direction
Cross-scope dependencies

03 — LOGIC ATTACHMENTS

Every script mapped

Each client script, business rule, and UI action listed with the table it attaches to and its trigger type.

onLoad / onChange / onSubmit
Business rules and script includes
UI actions and ACLs
Active vs inactive state

Blast Radius

Know what breaks
before you deploy

Before renaming, modifying, or removing any table, run one additional command. FluentGraph traverses your entire dependency graph and returns a ranked list of every artifact affected — along with a calculated risk level.

$ fluent-graph blast incident

🔥 Blast Radius — incident LOW RISK

Artifact	Type	Trigger
bank_incident_onload	client_script	onLoad
cs_incident_onload	client_script	onLoad
cs_incident_onchange_caller	client_script	onChange

Total impacted: 3 artifacts · client_script ×3

🔥 Blast Radius — x_1566039_seventh_account MEDIUM RISK

Artifact	Type	Reason
cs_account_onchange_type	client_script	onChange
x_1566_seventh_fraud_alert	field	FK via account
x_1566_seventh_risk_score	field	FK via account
x_1566_seventh_account_ext	field	FK via linked_account

Total impacted: 5 artifacts · client_script ×1 · field ×3

"Running blast radius before a major refactor revealed that deleting a custom field would have broken four client scripts, two business rules, and a UI action — all before a single line was deployed to production." — Eshwar Prasad Yaddanapudi, Creator of FluentGraph

Get Started

Up and running
in under a minute

FluentGraph requires Node.js 18 or later and an initialized ServiceNow Fluent SDK project. No account, no configuration file, no API keys.

Install globally via npm

Install the package once and use it across all your Fluent projects. Or invoke on-demand with npx without a global install.

$npm install -g @yesprasad/fluent-graph

Analyze your Fluent project

Navigate to the root of any Fluent project and run the analysis command. FluentGraph scans your artifact directory and produces a full dependency report in seconds.

$fluent-graph analyze

Run blast radius before any breaking change

Before modifying any table, supply its name as a target. FluentGraph returns the complete list of affected artifacts and a risk-level assessment.

$fluent-graph blast <table_name>

Wire into your CI/CD pipeline

The generated fluent-graph.json integrates with any pipeline. Add the step to your GitHub Actions workflow to block merges that introduce identity conflicts.

CI/CD Integration

Catch conflicts
before the merge

FluentGraph outputs structured JSON designed for automated pipeline consumption. Gate your deployments on dependency integrity without writing a single custom script.

.github/workflows/fluent-check.yml

name: Fluent Dependency Check on: [push, pull_request] jobs: analyze: runs-on: ubuntu-latest steps: # Extract full artifact lineage - name: Run FluentGraph run: npx @yesprasad/fluent-graph analyze # Block merge on identity conflicts - name: Validate dependency graph run: | if grep -q '"action": "CONFLICT"' fluent-graph.json; then echo "Identity conflicts detected" exit 1 fi

✓ Runs in GitHub Actions, GitLab CI, or Jenkins
✓ No authentication or external network calls required
✓ Structured JSON output for custom pipeline logic
✓ Automatically blocks deploys with unresolved identity conflicts
✓ Integrates with existing Fluent SDK toolchain workflows

Open Source · Free Forever

Map your application.
Ship with certainty.

Drop FluentGraph into any ServiceNow Fluent project and get your complete dependency graph in under sixty seconds.

View on npm Star on GitHub

TypeScript Complier API: Reading Object's Key and value pair using "ts-morph"

October 15, 2025 · YES PRASAD

Recently, I am trying to build a Domain Specific Language (DSL) which sits inside a TypeScript file (*.ts).

My aim is to read and convert this file into a JSON and then convert it into to JavaScript and then to XML.

To do this, we need to traverse the code inside this .ts file which has the DSL. This can be done either directly by TypeScript Compiler API which is more verbose and laborious or through a package called ts-morph which wraps the APIs into a more convenient interface.

I chose ts-morph for my use case.

Here is a sample code from my DSL. It assigns a constant with a function call that takes a plain JavaScript object.


export const server_script = ServerScript({
  name: 'script',
  metadata: {
    origin: 'external_api',
    module: 'customer',
    createdBy: 'Admin'
  }
});

To parse this, when we send the file to ts-morph, it creates an AST (Abstract Syntax Tree) which needs to be recursively parsed to extract the required information.

If we use the ASTExplorer tool, we will see a tree as below:


SourceFile
  └─ VariableStatement
       └─ VariableDeclarationList
            └─ VariableDeclaration (name: "server_script")
                 └─ Initializer → CallExpression
                      ├─ Expression → Identifier ("ServerScript")
                      └─ Arguments
                           └─ ObjectLiteralExpression
                                ├─ PropertyAssignment (key: "name")
                                │   └─ Initializer → StringLiteral ("script")
                                └─ PropertyAssignment (key: "metadata")
                                     └─ Initializer → ObjectLiteralExpression
                                          ├─ PropertyAssignment (key: "origin", value: "external_api")
                                          ├─ PropertyAssignment (key: "module", value: "customer")
                                          └─ PropertyAssignment (key: "createdBy", value: "Admin")

As we can notice, the object’s key–value pairs such as Key: name and Value: script can be extracted differently. Let’s see how.

We have to let the TypeScript compiler know that the node we are parsing is of type Node.ObjectLiteralExpression.

Here’s the step-by-step extraction process. First, we find all nodes which are of ServerScript type:


const scriptExpressions = sourceFile
  .getDescendantsOfKind(SyntaxKind.CallExpression)
  .filter(call => call.getExpression().getText() === 'ServerScript');

The result we get is of type CallExpression<ts.CallExpression>[], which is essentially nothing but an array.

we need to now iterate through the above CallExpression array. for now let us take we loop through the first element in the array. the code looks like below


const ObjectLit = callExpr.getArguments()[0].asKindOrThrow(SyntaxKind.ObjectLiteralExpression);
const currentProps = ObjectLit.getProperties();


currentProps.forEach((attribute: ObjectLiteralElementLike) => {
  const currAttribute = attribute.asKindOrThrow(SyntaxKind.PropertyAssignment);
  const key = currAttribute.getName();
  
});

as we can see above, the Key is easy to read - it is a simple .getName but to read the value we need to do .getInitializer() first to read the complete value and then check its type and then read the value, for example in our case it is a simple StringLiteral as seen above, it is currAttribute.getInitializer()?.getText()

but then why does ts-morph wants us to do this extra job to read the value and not a simple .getValue()??

This is because, the value here can be another ObjectLiteral or a CallExpression or anything. Thus the compiler wants us to explicitly understand what it is and hence gives us the option .getInitializer() to first get it and check its type and then extract the content.

ts-morph is an npm package that wraps the TypeScript compiler API to make syntax tree manipulations much simpler to perform using TypeScript itself.

LLM From RAG to Agentic RAG — Why I Built It, Whether It's Real, and What Comes Next

September 19, 2025 · YES PRASAD

AI AgentAgentic RAGReActInsuranceWellnessArchitecture

I have a confession: I forgot why I built this.

I finished my RAG series — three posts on chunking, hallucination, and token speed — and then kept coding. The repo grew a new folder called agent/. Files appeared: perceive.py, reason.py, act.py. I shipped it, wrote a README, and moved on. A few weeks later someone asked me why I added the agent layer at all, and I genuinely did not have a crisp answer.

So I went back and reconstructed it. This article is what I found — and more usefully, when a RAG system actually needs to become an agent, with real evidence from companies that have already made that call.

The Honest Reconstruction

Why I actually built this

It started with an ER visit. The bill was itemized in ways I did not understand, which sent me down a rabbit hole about how medical charges are actually structured. I built a RAG system to query policy documents — wellness plans, benefits handbooks, coverage tables — and get plain-language answers with citations.

That worked. But then a different question formed: what if an employee did not just want to read their benefits — what if they wanted to act on them? Check eligibility before a procedure. Compare two plan options side by side. Ask a follow-up that depended on the previous answer. These are not retrieval problems. They are reasoning problems that happen to need retrieval inside them.

"A RAG system answers. An agent pursues a goal. The moment your user's question requires more than one retrieval step to answer honestly, you are already in agent territory."

That is the line I crossed — probably without fully realising it. The wellness agent I built is designed for employees and HR teams at companies whose wellness benefits are administered through a provider. Simple queries like "what is the maximum gym claim?" are retrieval. But "compare my nutrition and mental health coverage and tell me which one I am underusing" — that requires intent classification, two separate retrievals, a synthesis step, and a structured answer. That is an agent workflow.

RAG vs Agent

The actual difference — not the theoretical one

Traditional RAG

One question. One retrieval. One answer.

Query comes in. Embed it. Fetch top-K chunks. Pass to LLM. Done. The flow is linear, static, and non-iterative. It cannot decide to look again.

Agentic RAG

A goal. Multiple steps. Adaptive retrieval.

The LLM plans, retrieves, observes the result, and decides: enough context, or retrieve again? It can call tools, cross-reference sources, and change strategy mid-run.

The practical difference shows up in three scenarios. First, when a query has multiple parts and each part needs its own retrieval. Second, when the answer to the first question determines what to ask next. Third, when you need the system to check its own answer — to retrieve a second source and verify before responding. Standard RAG cannot do any of these. An agent can.

The Signal

When does RAG need to become an agent?

Not always. This is important. Most RAG problems do not need an agent — adding one adds latency, cost, and complexity. The question is whether your use case hits the signals that justify it.

Multi-step queries

The user's question cannot be answered from a single retrieval. It requires sub-questions with intermediate results feeding forward.

Multiple data sources

The answer lives across different documents, APIs, or structured databases — and the agent must decide which to query and in what order.

Verification required

High-stakes answers (benefits, healthcare, compliance) need the system to cross-check its own response before surfacing it to the user.

Conversational continuity

The user's follow-up question depends on the previous answer. Without memory, each turn starts cold and loses context.

The Wellness AI Agent hits three of these. Benefits queries often span multiple plan documents. Comparisons require synthesis across sources. And in healthcare adjacent domains, a wrong answer has real consequences — verification is not optional.

Real World

Companies already running this in production

This is not experimental. Agentic RAG is in production at scale across industries — and the use cases closest to mine are the most instructive.

Production deployments · Agentic RAG

Uber · Genie

Their on-call engineering copilot uses agents at multiple stages: a Query Optimizer reformulates ambiguous questions, a Source Identifier narrows the document set, and a Post-Processor deduplicates retrieved context. Result: 27% more acceptable answers, 60% reduction in incorrect advice vs traditional RAG.

eBay · Mercury

Their internal agent platform combines RAG with real-time inventory data for product recommendations. The agent decides when to retrieve and what source to use — product docs, live listings, or user history — depending on the query context.

Healthcare RAG

Agentic RAG systems pulling patient history, clinical literature, and drug interaction databases improved radiology QA accuracy from 68% to 73%. The agent retrieves from verified sources and decomposes complex clinical questions before answering.

Enterprise HR / Benefits

Employee policy copilots answering HR, travel, and benefits queries with citations and effective dates are explicitly named as a production RAG use case in 2025 enterprise guides — exactly the domain of the Wellness Agent.

The Architecture

What I built — and what the ReAct loop looks like

The project structure separates concerns into two layers. The core/ folder handles the machinery: chunking, embedding, retrieval, LLM calls. The agent/ folder handles the reasoning loop: perception, planning, action. Separating these was the most important architectural decision I made — it means the agent layer can evolve without rewriting the retrieval layer.

Wellness AI Agent — Project Structure

── agent/ (the reasoning loop)

perceive.py → clean input, classify intent

reason.py → plan steps, assemble retrieval queries

act.py → call core/, generate answer, observe result

── core/ (the retrieval machinery)

chunker.py → paragraph-based, 2-sentence overlap

embedder.py → all-MiniLM-L6-v2 embeddings

retriever.py → ChromaDB vector search, top-K cosine

llm_interface.py → Ollama / local model calls

── documents/

wellness_plan.md → the benefit document corpus

The pattern the agent follows is close to ReAct — Reasoning + Acting — where the model interleaves thought steps with retrieval actions. Perceive observes the input. Reason plans the retrieval steps. Act executes them and observes the result. The difference between this and full ReAct is the feedback loop: in true ReAct, the observation from Act feeds back into Reason so the model can decide whether to retrieve again or stop. That loop is the next thing I am adding.

Current state

Agentic RAG pipeline — linear

Perceive → Reason → Act runs once per query. The agent classifies intent, retrieves relevant chunks from the wellness plan documents, synthesizes an answer with citations. Single pass, no iteration.

Works well for: Single-intent queries — "what is covered for mental health?" — where one retrieval retrieves enough to answer fully.

Next state

Full ReAct loop — iterative

Act feeds its observation back into Reason. The LLM decides: "I have enough context" → deliver answer, or "I need to retrieve from the nutrition section too" → loop again. Maximum 3–4 iterations before forcing a final answer.

Needed for: Multi-part queries — "compare my gym and nutrition coverage and flag anything expiring this year" — where a single retrieval does not surface enough.

What This Means

The phrase I keep coming back to: reactive to predictive

When I first wrote about AI agents, I included a phrase without fully unpacking it: reactive to predictive. Traditional software — and traditional RAG — is reactive. Something happens, the system responds. An agent is predictive: it has a goal, it plans steps toward that goal, and it adjusts when reality does not match the plan.

For wellness benefits, that shift matters more than it might seem. A reactive system answers what you asked. A predictive one might notice you asked about gym coverage but have not claimed it yet this year, and surface that proactively. That is not a retrieval problem. That is an agent problem. And it is exactly the kind of thing that makes the difference between a tool employees open once and a tool they actually rely on.

"The market for agentic RAG is projected to reach $40 billion by 2035. The underlying use case — grounded, auditable, multi-step reasoning over private documents — is the same one I am building for wellness benefits."

What's Next

Where this project goes from here

✓ Paragraph chunking with 2-sentence overlap in core/
✓ Intent classification in perceive.py
✓ Retrieval + generation with citations in reason/act
✓ Separated agent layer from retrieval layer
○ ReAct feedback loop: act observation → reason re-plan
○ Session memory in core/ — remember prior Q&A per user
○ Multi-document routing — which plan document to query first
○ Claims processing actions — not just answers, but submissions

Part 4 of the RAG → Agent series. Parts 1–3 cover building RAG from scratch without LangChain, chunking strategy, and local model benchmarking. The full code for the Wellness AI Agent — including the agent loop and core RAG machinery — is on GitHub.

Wellness AI Agent on GitHub · Built from scratch · No LangChain · Full agent loop coming.

View on GitHub RAG Repo LinkedIn

LLM The Experiments That Taught Me More About RAG Than Any Course

September 04, 2025 · YES PRASAD

BenchmarkingLocal LLMInferenceRAGOllamaQwen

I did not set out to run a benchmarking study. I was building a RAG system for policy documents, running it locally on a CPU-only machine with 8GB RAM, and things kept failing in ways that courses and tutorials never warned me about. Each failure taught me something. This article is those lessons — in the order I learned them, the hard way.

LLMs tested locally

8GB

RAM · CPU only

11–12

winning tokens/min

The Setup

Same hardware. Same documents. Three very different results.

Three models running locally via Ollama: DeepSeek 4b with reasoning mode, Qwen 1.5b, and Qwen 7b. One ChromaDB vector store with paragraph-chunked policy documents. Top-K set to 5 — meaning every query retrieved 5 paragraph chunks and passed them all to the model as context.

I was not trying to find the best model for general use. I was trying to find the best model for this specific task: answering questions about policy documents, on constrained hardware, with real retrieved context — not a toy prompt.

Test Setup

── ENVIRONMENT

Hardware: MacBook · CPU only · 8GB RAM

Runtime: Ollama · local inference · no GPU

Vector DB: ChromaDB · persistent · paragraph chunks

Top-K: 5 chunks per query

Queries: 20 policy questions · same across all models

── MODELS TESTED

DeepSeek 4b with reasoning mode

Qwen 1.5b quantized

Qwen 7b quantized

Experiment 1

DEEPSEEK 4B · REASONING MODE

Impressive in demos. Unusable on CPU.

DeepSeek's reasoning mode is compelling — you can watch the chain of thought unroll before the answer arrives. On a GPU, that is a feature. On a CPU with 8GB RAM, it is a liability.

Average throughput: 3–5 tokens per minute. A 200-token answer took between 40 and 60 minutes. No user will wait an hour for a policy answer. That is not a performance problem — it is a product-ending problem.

But speed was not the worst of it. Under top-K of 5, DeepSeek's coherence degraded. It would blend facts from different chunks — sometimes correctly, sometimes not — producing answers that sounded right but introduced numbers or conditions that were not in any of the retrieved documents. Subtle hallucination. The hardest kind to catch.

Lesson: Reasoning mode requires enough model capacity to hold the reasoning trace and the retrieved context simultaneously. On constrained hardware with large context, it becomes a liability, not an asset.

Experiment 2

QWEN 1.5B · FAST BUT BRITTLE

Faster throughput. Different failure mode.

Qwen 1.5b was faster — 7 to 9 tokens per minute — and for small contexts it performed reasonably. The problem emerged specifically with top-K of 5. Passing 5 paragraph chunks to a 1.5b parameter model is asking it to reason across more context than it can reliably hold.

The failure mode was different from DeepSeek. Qwen 1.5b did not blend facts — it truncated. It answered based on the first 1 or 2 chunks and ignored the rest, even when the most relevant information was in chunk 4 or 5. The retrieved context window exceeded the model's effective reasoning capacity, and it defaulted to what it saw first.

Lesson: Small quantized models have an effective context ceiling that is lower than their advertised context window. For RAG with large top-K, model capacity matters more than raw speed.

Experiment 3

QWEN 7B · THE SWEET SPOT

11–12 tokens/min. Coherence held.

Qwen 7b ran at 11 to 12 tokens per minute on the same CPU-only hardware. A 200-token answer in roughly 17 minutes — slow by API standards, but workable for a local system where privacy matters more than speed.

More importantly, coherence held under full top-K context. The model could reason across all 5 retrieved chunks, synthesize information from multiple paragraphs, and produce answers that stayed grounded in the retrieved content. Hallucination dropped dramatically compared to the smaller models.

Lesson: For local RAG on constrained hardware, 7b is the minimum viable model size for reliable top-K coherence. The speed penalty is real — and so is the quality gain.

The Full Picture

Why top-K matters more than people think

Full Benchmark Results · Top-K=5 · CPU only · 8GB RAM

MODEL TOK/MIN COHERENCE FAILURE MODE DeepSeek 4b3–5Failed ✗Fact blending Qwen 1.5b7–9Degraded △Context truncation Qwen 7b11–12Consistent ✓None observed

Top-K is the number of retrieved chunks passed to the LLM as context. Higher top-K means more retrieved information — which sounds better. But there is a model capacity ceiling. Below it, more context improves answers. Above it, the model starts losing coherence, dropping chunks, or blending facts incorrectly.

The right top-K is not a universal constant. It is a function of your model's capacity, your chunk size, and your document type. For my system — paragraph chunks averaging 150–200 words, policy documents, Qwen 7b — top-K of 5 was the practical ceiling before quality started degrading.

The Most Important Experiment

The one no benchmark tells you: what does a wrong answer look like?

Token speed and coherence counts are useful. But the most important experiment I ran was not quantitative. I manually read 20 answers, traced each one back to its citation, opened the source document at the cited lines, and asked: did the model stay inside the retrieved context, or did it bring in something from outside?

"The citation is not just a trust signal for the user. It is the only way to know, after the fact, whether your model stayed inside the retrieved context or started making things up."

Smaller models failed this test more often. Not always — sometimes they were perfectly accurate on simple queries. But on complex multi-clause policies with conditions and exceptions, they would occasionally introduce a number or qualifier that was not in the cited lines. Without citations, you would never know. You would just trust the confident-sounding answer.

What I'd Do Differently

Three decisions I'd make from day one

Decision 1

Start with 7b minimum

Do not optimise for speed before you know your model can hold the context load. Start with a model that can handle your top-K reliably.

Decision 2

Citations from line one

Not as a feature — as infrastructure. Every chunk that touches the LLM should carry its source metadata from the very beginning.

Decision 3

Measure token speed early

On local hardware, the difference between 5 and 12 tokens/min is the difference between a usable tool and an unusable one.

Part 3 of 3. Part 1 covers building the full system from scratch without LangChain. Part 2 covers chunking strategy — why paragraph boundaries beat fixed-size windows and why overlap helps the LLM, not retrieval.

Code on GitHub · Fluent-Graph on Blog · Built by Eshwar Prasad Yaddanapudi

View on GitHub Fluent-Graph Blog LinkedIn

LLM What RAG Tutorials Don't Teach You — Chunking, Hallucination, and Token Speed

September 02, 2025 · YES PRASAD

Chunking StrategyHallucinationLocal LLMRAGOverlap

Most RAG tutorials show you how to get an answer. Very few show you how to get a correct answer — and almost none explain what causes a wrong one. After building a multi-document RAG system from scratch, I ran into three problems no tutorial warned me about: chunking destroying meaning, quantized models hallucinating under load, and token speed making some models unusable on real hardware.

This is what I learned — and what I would do differently from day one.

Problem 1

Fixed-size chunking destroys meaning

The default chunking strategy in most tutorials is fixed-size — split every 512 tokens, maybe with a small overlap. It is simple, predictable, and fast. It is also wrong for documents where meaning lives inside a continuous clause.

I was chunking policy documents. A policy clause like "employees are entitled to reimbursement up to $500 provided the expense was pre-approved by a line manager and falls within the categories defined in Appendix B" — split at a fixed boundary — becomes two useless half-sentences. Neither chunk retrieves well. Neither answers the question correctly.

"Chunking is not a technical decision. It is a semantic decision. The unit of your chunk should be the unit of meaning in your document."

Chunking Comparison — Policy Document

── FIXED-SIZE (512 tokens)

Chunk A: "...employees are entitled to reimbursement up to $500 provided the"

✗ missing: who qualifies, what categories apply

Chunk B: "expense was pre-approved by a line manager and falls within..."

✗ missing: the amount, the subject of the clause

── PARAGRAPH-BASED

Chunk A: "employees are entitled to reimbursement up to $500 provided the expense

was pre-approved by a line manager and falls within Appendix B categories."

✓ complete semantic unit — retrieves and answers correctly

My solution: chunk by paragraph boundary. A blank line in the document signals the end of a semantic unit. When the chunker hits two consecutive newlines, it closes the current buffer and stores the chunk. The result is chunks that map to human-readable ideas — not arbitrary token counts.

Problem 2

Overlap is for the LLM — not for retrieval

Most tutorials that mention overlap describe it as improving retrieval — the idea being that repeated content increases the chance a chunk matches a query embedding. That is not quite right. Overlap does not meaningfully change cosine similarity scores. What it actually does is help the LLM.

When your retriever returns chunk 7, that chunk starts cold. The LLM has no idea what was in chunk 6. If a clause spans a paragraph boundary — condition in paragraph 6, consequence in paragraph 7 — the LLM reading only chunk 7 will generate an incomplete or wrong answer.

Without overlap

Chunk 7 starts cold

LLM misses the condition set in paragraph 6. Answer is incomplete or fabricated from training data.

With 2-sentence overlap

Chunk 7 has context

Last 2 lines of paragraph 6 prefix chunk 7. LLM has continuity. Answer is grounded and complete.

I added the last 2 sentences of each paragraph as a prefix to the next chunk. This is not retrieval improvement — it is generation improvement. Keep that distinction clear, because it changes how you would measure whether it is working.

"The overlap is not for the embedding model. It is for the LLM that reads what the embedding model found."

Problem 3

Quantized models hallucinate under large top-K

I tested three local models: DeepSeek 4b, Qwen 1.5b, and Qwen 7b. On small top-K contexts, all three performed reasonably. The problems emerged when I increased top-K to 5, passing more retrieved context to the model.

Smaller quantized models started hallucinating — not dramatically, but in the subtle way that is hardest to catch: confident answers that almost matched the policy document but introduced a number or condition that was not there. The model could not hold the full context and started filling gaps from its training data instead of the retrieved chunks.

Model Benchmark · Top-K=5 · CPU only · 8GB RAM

MODEL TOKENS/MIN COHERENCE DeepSeek 4b3–5Hallucinated ✗ Qwen 1.5b7–9Degraded △ Qwen 7b11–12Consistent ✓

Measured on MacBook · CPU-only inference · averages over 20 queries

DeepSeek with its reasoning mode was particularly slow — 3 to 5 tokens per minute on CPU-only hardware. For a single query that might mean waiting 40 to 60 minutes for an answer. That is not a performance problem — it is a product-ending problem. Qwen 7b at 11 to 12 tokens per minute was the only model that balanced speed and coherence at top-K of 5.

The Real Insight

Citations as your debugging loop

Every answer in my system includes a citation: the filename, the start line, and the end line of every chunk that contributed to the response. I added this for user trust. What I quickly realized is that citations are also the most powerful debugging tool in the entire pipeline.

If the citation points to lines 42–67 and the answer is wrong, I know exactly where to look. Open the document, read those lines, ask: was the answer actually in that chunk? Was it too large? Was the answer split across a boundary?

Answer Output — with Citations

Answer: Employees are entitled to reimbursement up to $500...

📚 Sources:

1. policy_handbook.txt (lines 42–67)

2. expense_guidelines.txt (lines 12–28)

# if answer is wrong → open file → read lines 42-67 → diagnose chunk boundary

"Run 10 policy questions. Record the citations. Read the cited chunks manually. That process is the foundation of a RAG evaluation loop — and it costs nothing except discipline."

Part 2 of 3. Part 1 covers building the full system from scratch without LangChain and why that choice matters. Part 3 covers the complete model benchmarking experiment — three models, one CPU, and what the token speed numbers mean for production inference decisions.

Full code on GitHub — with comments left in on purpose, so you can see the thinking.

View on GitHub Connect on LinkedIn

LLM I Built a RAG System Without LangChain — Here's What I Actually Learned

September 01, 2025 · YES PRASAD

RAGLLMChromaDBOllamaPythonNo Frameworks

Every RAG tutorial I found started the same way: pip install langchain. And then the magic happened — a few method calls, a vector store, and suddenly you had a "production RAG system." Except you didn't. You had a wrapper around a wrapper around a wrapper, and when something went wrong, you had no idea where to look.

So I built one from scratch. No LangChain. No LlamaIndex. Just Python, Sentence-Transformers, ChromaDB, and a local LLM running on my laptop. Here is what building it from the ground up taught me that no tutorial covered.

The Problem

Why most RAG demos don't prepare you

Scenario A

You follow the tutorial.
It works perfectly.

Three API calls, a vector store, a retriever. The demo answers correctly. You ship it. Then it fails on real documents in ways you can't debug because you don't know what's inside the abstraction.

Scenario B

You build from scratch.
You break everything.

Your chunker splits a policy clause mid-sentence. Your model hallucinates on large contexts. You fix both — and now you understand every layer because you wrote every layer.

"I chose not to use LangChain deliberately. This helped me internalize the mechanics of RAG — from document parsing to contextual retrieval — all the way through model interaction."

The Architecture

Two pipelines. Every stage visible.

RAG Pipeline — KnowledgeAI

── INGESTION (runs once per document set)

Plain Text Docs → Paragraph Chunker → 2-Sentence Overlap → MiniLM Embeddings → ChromaDB Store

── RETRIEVAL + GENERATION (every query)

User Question → Query Embed → Top-K Cosine → Context Assembly → Qwen 7b · Ollama

── OUTPUT

✓ Answer + filename · start_line · end_line

There are two pipelines that run at different times. Ingestion happens once. Retrieval and generation happen on every query. Understanding that separation is the first thing frameworks hide from you. When you write it yourself, the boundary is a function call — and you know exactly what crosses it.

What I Built

The system — and what I left visible on purpose

The system reads plain text policy documents, chunks them into paragraphs with 2-sentence overlap, generates embeddings using all-MiniLM-L6-v2, stores them in a persistent ChromaDB instance, and answers natural language questions by retrieving the most relevant chunks and passing them to a local LLM via Ollama.

I left commented logic, print statements, and intermediate variables in the code deliberately. Comments like "this is valid only for the very first para we read from the file" are in there on purpose — they show the edge cases I found while building, not polished away to look tidy.

Chunking

Paragraph-based

Blank line = end of semantic unit. Preserves meaning that fixed-size windows destroy.

Overlap

2-sentence prefix

Last 2 lines of each chunk prefix the next. Helps the LLM — not retrieval scores.

Citations

File + line range

Every answer includes filename, start line, end line. Auditable and debuggable.

The Numbers

What benchmarking three local models revealed

LLMs benchmarked locally

8GB

RAM · CPU only

11–12

winning tokens/min

Model Benchmark · Top-K=5 · CPU only · 8GB RAM

ModelTokens/minCoherence at top-K=5 DeepSeek 4b3–5Hallucinated ✗ Qwen 1.5b7–9Degraded △ Qwen 7b11–12Consistent ✓

The failure modes were different for each model. DeepSeek blended facts across chunks — subtle hallucination, confident tone. Qwen 1.5b truncated — it answered from the first chunk and ignored the rest. Qwen 7b held coherence across all 5 retrieved chunks. Same hardware, same documents, very different results.

The Insight

Citations are your debugging tool — not just a trust signal

I added citations for user trust. What I discovered quickly is that they are also the most powerful debugging instrument in the pipeline. When an answer is wrong, I know exactly where to look — open the document, read the cited lines, ask: was the answer actually in that chunk? Was it too large? Was the answer split across a boundary?

"Citations aren't just for the user's trust. They are your chunking strategy's unit test."

What's Honest

What works — and what's next

The system works. But it isn't fully production-ready, and I'm honest about that in the README. Three things are missing: an evaluation harness that scores faithfulness programmatically, ingestion idempotency so re-running doesn't create duplicate chunk IDs, and a distance threshold filter so low-confidence retrievals don't pollute the prompt.

✓ Paragraph chunking with 2-sentence overlap
✓ Citation-grounded retrieval with line tracking
✓ Local LLM benchmarking across 3 models
○ LLM-as-judge evaluation harness — next
○ Ingestion idempotency via content hashing — next
○ Distance threshold filtering — next

Part 1 of 3. Part 2 goes deep on chunking strategy — why paragraph boundaries beat fixed-size windows for policy documents, and why overlap helps the LLM not retrieval. Part 3 covers the full model benchmarking experiment and what token speed means for real inference decisions.

Full code and README on GitHub. Built to show how RAG works — not hide it.

View on GitHub Connect on LinkedIn

Production breaks are preventable

Everything your team needsto ship with confidence

Three sections.Complete visibility.

Know what breaksbefore you deploy

Up and runningin under a minute

Catch conflictsbefore the merge

Why I actually built this

The actual difference — not the theoretical one

One question. One retrieval. One answer.

A goal. Multiple steps. Adaptive retrieval.

When does RAG need to become an agent?

Companies already running this in production

What I built — and what the ReAct loop looks like

Agentic RAG pipeline — linear

Full ReAct loop — iterative

The phrase I keep coming back to: reactive to predictive

Where this project goes from here

Same hardware. Same documents. Three very different results.

Impressive in demos. Unusable on CPU.

Faster throughput. Different failure mode.

11–12 tokens/min. Coherence held.

Why top-K matters more than people think

The one no benchmark tells you: what does a wrong answer look like?

Three decisions I'd make from day one

Start with 7b minimum

Citations from line one

Measure token speed early

Fixed-size chunking destroys meaning

Overlap is for the LLM — not for retrieval

Chunk 7 starts cold

Chunk 7 has context

Quantized models hallucinate under large top-K

Citations as your debugging loop

Why most RAG demos don't prepare you

You follow the tutorial.It works perfectly.

You build from scratch.You break everything.

Two pipelines. Every stage visible.

The system — and what I left visible on purpose

Paragraph-based

2-sentence prefix

File + line range

What benchmarking three local models revealed

Citations are your debugging tool — not just a trust signal

What works — and what's next

Everything your team needs
to ship with confidence

Three sections.
Complete visibility.

Know what breaks
before you deploy

Up and running
in under a minute

Catch conflicts
before the merge

You follow the tutorial.
It works perfectly.

You build from scratch.
You break everything.