Fighting for ways to optimize search time in semantic search has become a daily struggle I deal with now. From all the struggles I've run into, I've learned that there are a few key factors that really determine how fast your semantic search can be. Most performance issues don't come from exotic optimizations, but from a small set of parameters and design choices that end up shaping everything else downstream.
1. Choose the Right Index: HNSW vs IVF
Let's start with the index. For most small to mid-scale workloads, HNSW is the default choice.
It performs well when the dataset is within a range where graph-based search remains memory-efficient and fast, typically up to a few million vectors depending on hardware. In this regime, it offers strong latency, recall, and operational simplicity without needing clustering or training steps.
IVF becomes more relevant as scale increases, often beyond roughly 10 million vectors, or when memory usage becomes a limiting factor. It improves efficiency by grouping vectors into clusters and restricting search to a subset of them, but introduces additional tuning complexity.
If there is uncertainty, HNSW is usually the safer starting point and remains sufficient for most applications before scale becomes a constraint.
2. HNSW Parameters and Why They Dominate Search Performance
Assuming you are using HNSW, the most important tuning parameters are those that affect how the graph is explored during query time. These directly influence both search speed and result quality, and even small changes can noticeably affect production behavior.
To understand the cost structure, compare against brute force search: O(N × D), where N is the number of vectors and D is embedding dimension.
efSearch
HNSW avoids scanning all vectors by navigating a graph, which reduces the effective cost to:
search time ∝ efSearch
This makes efSearch the main runtime control knob in HNSW, directly affecting both latency and recall.
Typical settings:
- 20 - 50: very fast queries, lower recall
- 50 - 150: balanced performance
- 150 - 300+: higher recall, higher latency
topK
topK defines how many results are returned after search. It does not affect traversal, only how many items are selected from the explored candidates.
Its usefulness depends on efSearch. If efSearch is too small, the candidate pool is weak, and increasing topK only returns more low-quality results rather than improving relevance.
M and efConstruction
Mcontrols graph connectivity and memory usageefConstructioncontrols index build quality
These influence how well the graph is structured during indexing. A better graph can improve recall at a given efSearch, but they do not directly affect how many nodes are visited per query, making their impact on latency indirect.
3. Embedding Dimension and Compute Cost
Embedding dimension directly affects search speed because similarity computation scales linearly with vector size.
search time ∝ embedding dimension
Since efSearch determines how many comparisons are performed, total query time scales with both factors.
Common real-world examples:
- all-MiniLM-L6-v2: 384 dimensions
- bge-base / e5-base: 768 dimensions
- text-embedding-ada-002: 1536 dimensions
Higher dimensions increase per-comparison cost proportionally, making embedding size one of the most important latency drivers in practice.
4. Distance Metric Cost
Common similarity metrics include cosine similarity, dot product, and L2 distance.
Cosine similarity is most commonly used for text embeddings. When vectors are normalized, it reduces to a dot product, which is computationally efficient. All three methods have similar asymptotic cost since they require a full pass over the embedding dimensions.
| Metric | Speed | Notes |
|---|---|---|
| Dot product | Fastest | Simple and highly optimized |
| Cosine similarity | Very fast | Equivalent to dot product if normalized |
| L2 distance | Slightly slower | Extra arithmetic per dimension |
In practice, performance differences are minor. The main constraint is consistency with how the embedding model was trained.
5. Filtering Strategy
Most production systems combine semantic search with metadata filters such as location, category, language, or business constraints.
Filtering affects both latency and recall depending on where it is applied.
If filtering happens after retrieval, compute is wasted on irrelevant candidates. If filtering is too strict before retrieval, recall suffers.
A balanced approach is to combine metadata constraints with vector search or retrieve a slightly larger candidate pool before applying filters.