Tuning Semantic Search for Real Performance Gains
Back to Blog

Tuning Semantic Search for Real Performance Gains

Fighting for ways to optimize search time in semantic search has become a daily struggle I deal with now. From all the struggles I've run into, I've learned that there are a few key factors that really determine how fast your semantic search can be. Most performance issues don't come from exotic optimizations, but from a small set of parameters and design choices that end up shaping everything else downstream.

1. Choose the Right Index: HNSW vs IVF

Let's start with the index. For most small to mid-scale workloads, HNSW is the default choice.

It performs well when the dataset is within a range where graph-based search remains memory-efficient and fast, typically up to a few million vectors depending on hardware. In this regime, it offers strong latency, recall, and operational simplicity without needing clustering or training steps.

IVF becomes more relevant as scale increases, often beyond roughly 10 million vectors, or when memory usage becomes a limiting factor. It improves efficiency by grouping vectors into clusters and restricting search to a subset of them, but introduces additional tuning complexity.

If there is uncertainty, HNSW is usually the safer starting point and remains sufficient for most applications before scale becomes a constraint.

2. HNSW Parameters and Why They Dominate Search Performance

Assuming you are using HNSW, the most important tuning parameters are those that affect how the graph is explored during query time. These directly influence both search speed and result quality, and even small changes can noticeably affect production behavior.

To understand the cost structure, compare against brute force search: O(N × D), where N is the number of vectors and D is embedding dimension.

efSearch

HNSW avoids scanning all vectors by navigating a graph, which reduces the effective cost to:

search time ∝ efSearch

This makes efSearch the main runtime control knob in HNSW, directly affecting both latency and recall.

Typical settings:

  • 20 - 50: very fast queries, lower recall
  • 50 - 150: balanced performance
  • 150 - 300+: higher recall, higher latency

topK

topK defines how many results are returned after search. It does not affect traversal, only how many items are selected from the explored candidates.

Its usefulness depends on efSearch. If efSearch is too small, the candidate pool is weak, and increasing topK only returns more low-quality results rather than improving relevance.

M and efConstruction

  • M controls graph connectivity and memory usage
  • efConstruction controls index build quality

These influence how well the graph is structured during indexing. A better graph can improve recall at a given efSearch, but they do not directly affect how many nodes are visited per query, making their impact on latency indirect.

3. Embedding Dimension and Compute Cost

Embedding dimension directly affects search speed because similarity computation scales linearly with vector size.

search time ∝ embedding dimension

Since efSearch determines how many comparisons are performed, total query time scales with both factors.

Common real-world examples:

  • all-MiniLM-L6-v2: 384 dimensions
  • bge-base / e5-base: 768 dimensions
  • text-embedding-ada-002: 1536 dimensions

Higher dimensions increase per-comparison cost proportionally, making embedding size one of the most important latency drivers in practice.

4. Distance Metric Cost

Common similarity metrics include cosine similarity, dot product, and L2 distance.

Cosine similarity is most commonly used for text embeddings. When vectors are normalized, it reduces to a dot product, which is computationally efficient. All three methods have similar asymptotic cost since they require a full pass over the embedding dimensions.

Metric Speed Notes
Dot product Fastest Simple and highly optimized
Cosine similarity Very fast Equivalent to dot product if normalized
L2 distance Slightly slower Extra arithmetic per dimension

In practice, performance differences are minor. The main constraint is consistency with how the embedding model was trained.

5. Filtering Strategy

Most production systems combine semantic search with metadata filters such as location, category, language, or business constraints.

Filtering affects both latency and recall depending on where it is applied.

If filtering happens after retrieval, compute is wasted on irrelevant candidates. If filtering is too strict before retrieval, recall suffers.

A balanced approach is to combine metadata constraints with vector search or retrieve a slightly larger candidate pool before applying filters.