From ChromaDB to S3 Vectors — What I Learned About How Embeddings Actually Work

ChromaDB  ·  S3 Vectors  ·  RAG  ·  AWS Bedrock

I have been building RAG systems for a while now. ChromaDB locally, then AWS with Bedrock and S3 Vectors. As I was building, I asked myself: "so where exactly does the embedding sit?"

So I went deep. I traced one sentence — one actual benefits document chunk — from raw text all the way through embedding, storage, and retrieval. In two completely different systems. What came out the other side changed how I think about vector databases in general.


A simple question

My RAG system answers health insurance benefits questions. The documents contain sentences like this:

"Preventive care: Annual physical, immunizations, and screenings
covered at 100% in-network, no copay."

When a user asks "Is my annual checkup free?" — they never wrote those exact words. The RAG system has to find that sentence anyway. That is the whole point of vector search: finding semantic similarity, not keyword matches.

To get there, that sentence has to become a number. Actually, it has to become a list of numbers. Let me show you exactly what happens — in both worlds.


Same pipeline, very different internals

I built the same pipeline twice. First with sentence-transformers + ChromaDB running locally. Then with Amazon Titan V2 + S3 Vectors on AWS via CDK. Same input. Same goal.

World 1 — Local / ChromaDB
self.model = SentenceTransformer('all-MiniLM-L6-v2')
self.client = chromadb.PersistentClient(path='./chroma_store')
self.collection = self.client.get_or_create_collection('policies')
World 2 — AWS / S3 Vectors
bedrock_runtime = boto3.client("bedrock-runtime", region_name="us-east-1")
s3vectors = boto3.client("s3vectors", region_name="us-east-1")

One runs inference on my laptop. The other makes API calls to AWS. But the shape of what they produce — and what they store — is structurally the same.


Turning text into numbers

embedding — encode()
"Preventive care: Annual physical, immunizations..."encode()  or  invoke_model()
[0.0231, -0.1847, 0.0093, 0.2341, ..., -0.0876]

With all-MiniLM-L6-v2: The model weights (~80 MB) live on your machine. model.encode(text) runs inference locally and returns a numpy array of 384 floats. No network call. No cost. Fast.

With Titan V2: A POST request goes to Bedrock. AWS runs the model on their infrastructure and returns a list of 1024 floats. Network latency. Billed per token.

"The embedding is not a summary. It is a coordinate. A point in high-dimensional space where meaning lives as geometry."

What gets stored and where

This is where the two systems diverge most visibly.

ChromaDB — add
def process_chunk(self, chunk_text, chunk_id):
    embeddings = self.model.encode(chunk_text)
    self.collection.add(
        documents=[chunk_text],
        embeddings=[embeddings.tolist()],
        ids=[chunk_id]
    )
AWS — put_vectors()
vectors_payload.append({
    "key": f"{policy}_chunk_{idx:04d}",
    "data": {"float32": embedding},         # the 1024 floats
    "metadata": {
        "policy":      "PPO_PLUS",
        "chunk":       chunk_text,          # original text stored here
        "chunk_index": idx,
        "source":      "docs/ppo_plus_2024.txt"
    }
})
s3vectors.put_vectors(
    vectorBucketName=BUCKET, indexName=INDEX, vectors=vectors_payload
)

The structure is the same: a unique key, the float vector, and the original text traveling alongside. The difference is that my ChromaDB version stores no policy metadata — which matters enormously at query time.


The full pipeline visualised

YOUR SENTENCE (raw text)
         │
         ▼
┌─────────────────────────┐
│   EMBEDDING MODEL       │
│                         │
│  MiniLM  →  384 floats  │
│  Titan   →  1024 floats │
└─────────┬───────────────┘
          │
          ▼
┌─────────────────────────────────────────────────┐
│              VECTOR STORE                        │
│                                                  │
│  KEY          │ VECTOR          │ TEXT/METADATA  │
│  chunk_0042   │ [0.02, -0.18…] │ "Preventive…"  │
│  chunk_0043   │ [0.11,  0.03…] │ "Deductible…"  │
│  chunk_0044   │ [-0.07, 0.22…] │ "Out-of-pocket"│
│                                                  │
│  ◄── similarity search runs here ──►             │
│        (on the float columns only)               │
└─────────────────────────────────────────────────┘
          │
          ▼
    QUERY: "Is my annual checkup free?"
         │  embed this too
         ▼
    [0.019, -0.176, ...]   ← 384 or 1024 floats
         │  cosine similarity against stored vectors
         ▼
    chunk_0042  score 0.91  ✓  return its text
    chunk_0019  score 0.74
    chunk_0031  score 0.71

ChromaDB uses two storage systems

I thought ChromaDB was Magic! It is not. It uses two completely separate storage systems — and understanding why is the most interesting technical detail in this entire article.

./chroma_store — on disk
./chroma_store/
├── chroma.sqlite3          ← ids, documents, metadata
└── <collection-uuid>/
    ├── data_level0.bin     ← the actual float vectors (HNSW)
    ├── header.bin
    └── length.bin
SQLite — chroma.sqlite3

Human-readable data

IDs, original document text, and metadata — everything you can read, filter, and query with standard SQL.

HNSW — data_level0.bin

The float vectors

Raw float arrays stored in a Hierarchical Navigable Small World graph, purpose-built for nearest-neighbour search.

Root Cause

Why two systems?

SQLite is not designed for "find me the 3 rows whose float arrays are geometrically closest to this other float array." HNSW builds a multi-layered graph and navigates it hierarchically — O(log n) instead of brute-force O(n). The two systems work together: HNSW returns IDs 42, 17, 8 → SQLite returns their text.

HNSW GRAPH — simplified
HNSW GRAPH (simplified)

Layer 2 (coarse):    A ──────────────── E
                          \          /
Layer 1 (medium):    A ─── C ─────── E ─── G
                     │     │         │
Layer 0 (fine):    A─B─C─D─E─F─G─H─I─J  (all vectors here)

 Query enters at Layer 2, navigates toward target region,
  then drills to Layer 0 for exact nearest neighbors.
"ChromaDB looks like one thing from the outside. Inside, it's a SQLite database and an HNSW graph file working as a team — one for the math, one for the meaning."

Why metadata filtering actually matters

At query time, the user asks: "What is my deductible?" That question gets embedded into the same vector space as all stored chunks. Here is where the two implementations diverge in a way that directly affects answer quality.

Scenario A — ChromaDB, no metadata

Imprecise retrieval.

A question about PPO deductibles might return HMO chunks. The similarity search does not know which plan the user is asking about.

Mixed results — PPO question
returned HMO_SELECT chunks
Scenario B — S3 Vectors, filtered

Precise retrieval.

Policy is stored as metadata at ingest and used to filter before similarity search runs. Only the correct plan's chunks are ranked.

The right way — ChromaDB supports this too
results = self.collection.query(
    query_embeddings=[question_embedding.tolist()],
    n_results=3,
    where={"policy": "PPO_PLUS"}   # filter equivalent
)

The lesson: metadata is not decoration. It is the mechanism that makes multi-document RAG precise. Build it in from day one.


ChromaDB vs S3 Vectors — side by side

Dimension ChromaDB (local) S3 Vectors (AWS)
Embedding modelall-MiniLM-L6-v2 localTitan Embed V2 via Bedrock
Vector dimensions3841024
Where model runsYour machine (CPU)AWS infrastructure
Embedding costFree (compute only)Billed per token
Vector storageHNSW .bin files on diskManaged ANN index
Text/metadataSQLite (chroma.sqlite3)Metadata fields in S3 Vectors
Query filteringwhere={"policy": ...}returnMetadata=True + filter
Scales beyond one machineNo (file-based)Yes (managed service)
Setup complexitypip install chromadbCDK stack + IAM + bootstrap
Best forLocal dev, prototypingProduction, multi-user, scale

What I would tell myself before starting

  1. ChromaDB is not Magic!. It is SQLite plus HNSW. The SQLite handles text and metadata retrieval. The HNSW handles the actual vector math. They are different tools solving different problems, bundled into one library.
  2. The embedding dimension is a contract. If you ingest with all-MiniLM-L6-v2 (384 dims) and query with Titan V2 (1024 dims), you get an error or nonsense results. Whatever model you use at ingest time, you must use at query time. Always.
  3. Metadata is not optional. The moment you have more than one document type or user context in your vector store, metadata filtering is how you keep retrievals precise. Build it in from day one.
  4. Local to cloud is architectural, not technical. The concepts — embed, store, search, retrieve — are identical. The execution model changes: local CPU vs API call, disk files vs managed service, free vs billed. The hardest part was IAM and CDK, not the vector logic.

Not by reading, but by geometry

The RAG pipeline I built for wellness benefits uses S3 Vectors in production now — with Titan V2 for embeddings, Nova Micro for generation, and Bedrock Guardrails on both input and output. The vector search itself is one step in a larger chain.

But understanding that one step precisely — what a vector is, where it lives, what retrieves it, and why ChromaDB needs both SQLite and HNSW to do the job — changed how I reason about everything above it.

That sentence about preventive care? It is a point in 1024-dimensional space now. When someone asks if their checkup is free, the system finds that point — not by reading, but by geometry.

RAG
Building RAG systems and agentic pipelines on AWS. Part of the RAG → Agentic RAG series — previous posts cover chunking strategy, hallucination handling, and the ReAct agent loop.