Google recently released their new Gemini Embedding 2, a natively multimodal embedding model. It can map text, images, video, audio, and even PDF files into a single embedding space. This means you can easily perform semantic searches on images without first having to generate automatic image descriptions, for example.
That’s very exciting. However, I’ve been using BGE-M3 embeddings for a personal project because of their excellent multilingual support. I’m curious how well Gemini Embedding 2 handles text compared to BGE-M3.
I’ve done some research, and here are the results:
| Category | Gemini Embedding 2 | BGE-M3 |
|---|---|---|
| Primary focus | General-purpose embeddings | Retrieval-optimized embeddings |
| Best use cases | Clustering, classification, semantic similarity, recommendations | Search, RAG, document retrieval |
| Vector dimensions | Flexible (≈128–3,072) | Fixed (~1,024) |
| Max tokens | 8,192 | 8,192 |
| Multilingual support | 100+ languages | 100+ languages |
| Embedding type | Dense only | Dense + sparse + multi-vector |
| Retrieval quality | Good, but generic | Excellent (SOTA-level for IR tasks) |
| Semantic understanding | Excellent (broad + deep) | Very good (but biased toward retrieval) |
| Keyword sensitivity | Weak–moderate | Strong (captures exact terms better) |
| Handling long documents | Good | Excellent |
| Query-to-document matching | Good | Excellent (purpose-built) |
| Ranking precision (top-k) | Moderate | Excellent |
| General benchmark (MMTEB) | Top-tier | Slightly lower but competitive |
| Retrieval benchmarks (MIRACL, etc.) | Good | State-of-the-art |