Vector Databases: The Unsung Heroes of RAG

Every RAG pipeline has the same bottleneck: how quickly and accurately can you find the right documents for a given query? Traditional full-text search fails because it matches keywords, not meaning. Vector databases solve this by storing text as high-dimensional numerical representations and finding semantically similar content in milliseconds. Here is how they work under the hood.

How Text Becomes Vectors

Before text can be stored in a vector database, it must be converted into a numerical representation called an embedding. Embedding models like OpenAI's text-embedding-3-large or open-source alternatives like BGE and E5 take a chunk of text — usually a paragraph or a few sentences — and produce a fixed-length array of floating-point numbers, typically 768 to 3072 dimensions.

These numbers encode the semantic meaning of the text. Two passages about "container orchestration" and "managing Docker deployments" will produce embeddings that are close together in vector space, even though they share few common words. A passage about "Italian cooking" will produce an embedding far away from both. This property — that semantic similarity maps to geometric proximity — is what makes vector search so powerful.

The quality of your RAG pipeline depends directly on the quality of your embeddings. Choosing the right model matters: domain-specific embedding models trained on technical content outperform general-purpose models for technical documentation. Chunk size also matters — too large and the embedding becomes diluted across multiple topics; too small and it loses context. Most teams find that 200-500 tokens per chunk works best.

Indexing Algorithms: HNSW vs IVF

The naive approach to vector search — comparing the query vector against every stored vector — is called brute-force search. It gives perfect accuracy but scales linearly with database size. With a million vectors at 1536 dimensions, each search requires computing a million distance calculations, which takes tens of seconds on CPU.

Approximate Nearest Neighbor (ANN) algorithms trade a small amount of accuracy for massive speed improvements. HNSW (Hierarchical Navigable Small World) is the most popular. It builds a multi-layer graph where each node connects to its nearest neighbors. Search starts at the top layer with coarse connections and drills down to fine-grained layers. This produces 99%+ recall with search times under 10 milliseconds, even for billion-scale datasets.

IVF (Inverted File Index) takes a different approach. It clusters vectors into partitions using k-means, then only searches the nearest clusters during a query. IVF uses less memory than HNSW and is faster to build, making it better suited for frequently updated datasets. The tradeoff is slightly lower recall, especially when the number of partitions is too high relative to the query's nprobe parameter.

Choosing a Vector Database

The vector database landscape has exploded. Purpose-built options like Pinecone, Weaviate, Qdrant, and Milvus offer managed services with built-in indexing, filtering, and multi-tenancy. They are the easiest to get started with and provide the best query performance for large-scale workloads.

Alternatively, existing databases have added vector capabilities. PostgreSQL with the pgvector extension supports HNSW and IVF indexing. This is a compelling option if your application already uses PostgreSQL — you can store your documents, metadata, and vectors in the same database, simplifying your infrastructure. The tradeoff is that pgvector is not as fast as purpose-built solutions for large vector collections, but for datasets under a few million vectors, the performance is more than adequate.

For multi-tenant SaaS applications, metadata filtering is critical. You need to be able to search within a single tenant's vectors without scanning the entire index. Look for databases that support pre-filtering (applying metadata constraints before the vector search) rather than post-filtering (searching all vectors and then removing non-matching results), as pre-filtering is significantly faster.

Optimizing RAG Retrieval Quality

Search accuracy is not just about the database — it is about the entire retrieval pipeline. Hybrid search, which combines vector similarity with traditional keyword matching (BM25), consistently outperforms either approach alone. The vector search catches semantically related content that keyword search misses, while keyword search catches exact matches that vector search might rank lower.

Re-ranking is another technique that improves quality. After retrieving the top 20-50 results from the vector database, a cross-encoder model re-scores each result against the original query. Cross-encoders are more accurate than bi-encoders (embedding models) because they process the query and document together, but they are too slow for the initial retrieval pass. This two-stage approach gives you the speed of vector search with the accuracy of cross-encoding.

Vector databases are infrastructure, not magic. The effort you invest in embedding model selection, chunk strategy, index configuration, and retrieval pipeline design determines whether your RAG system produces useful answers or irrelevant noise. Get the fundamentals right, and the database will do the heavy lifting.