What Are Embeddings?
Embeddings are dense numerical representations of data (text, images, audio) that capture semantic meaning in a continuous vector space. Unlike sparse representations like bag-of-words or TF-IDF, embeddings encode meaning such that similar concepts are mapped to nearby points in the vector space.
Modern text embeddings typically have 768 to 4096 dimensions. Each dimension captures some aspect of meaning, though individual dimensions aren't directly interpretable. The key property is that the distance between vectors corresponds to semantic similarity—documents about similar topics will have embeddings that are close together.
Embedding Models
Proprietary Models
Leading embedding models from major AI providers include:
- OpenAI text-embedding-3-large: 3072 dimensions, excellent general-purpose performance
- OpenAI text-embedding-3-small: 1536 dimensions, good balance of quality and cost
- Cohere embed-v3: Strong multilingual support, good for diverse content
- Google Vertex AI Embeddings: Tight integration with Google Cloud
- Voyage AI: Specialized models for code, legal, and other domains
Open-Source Models
High-quality open-source alternatives include:
- BGE (BAAI General Embedding): Excellent performance, multiple sizes available
- E5: Microsoft's embedding model with strong benchmark results
- GTE (General Text Embeddings): Alibaba's competitive embedding model
- Instructor: Task-specific embeddings with instruction prefixes
- Nomic Embed: Fully open training data and weights
Choosing an Embedding Model
Key factors when selecting an embedding model:
- Domain fit: Some models perform better on specific domains (code, legal, medical)
- Dimension size: Higher dimensions capture more nuance but require more storage
- Max sequence length: How much text can be embedded at once
- Latency and cost: API costs for proprietary models, compute costs for self-hosted
- Multilingual support: Essential for non-English or mixed-language content
Vector Databases
Vector databases are specialized systems for storing and searching high-dimensional embeddings. They use approximate nearest neighbor (ANN) algorithms to find similar vectors efficiently at scale.
Purpose-Built Vector Databases
- Pinecone: Fully managed, easy to use, excellent for getting started
- Weaviate: Open-source, supports hybrid search natively
- Milvus: Open-source, highly scalable, strong community
- Qdrant: Open-source, Rust-based, good performance
- Chroma: Lightweight, developer-friendly, good for prototyping
Vector Extensions for Traditional Databases
- pgvector (PostgreSQL): Add vector search to existing Postgres deployments
- Elasticsearch kNN: Vector search in Elasticsearch clusters
- Redis Vector Similarity: In-memory vector search
Similarity Metrics
Different metrics measure vector similarity in different ways:
- Cosine Similarity: Measures angle between vectors, ignoring magnitude. Most common for text embeddings.
- Euclidean Distance (L2): Straight-line distance between points. Sensitive to vector magnitude.
- Dot Product: Fast computation, equivalent to cosine for normalized vectors.
- Manhattan Distance (L1): Sum of absolute differences, sometimes used for sparse vectors.
For most text embedding applications, cosine similarity is the standard choice because it normalizes for vector length and focuses on directional similarity.
Implementation Best Practices
Chunking Strategies
Long documents must be split into chunks before embedding. Effective chunking strategies include:
- Fixed-size chunks: Simple but may split semantic units
- Semantic chunking: Split at paragraph or section boundaries
- Overlapping windows: Include context from adjacent chunks
- Recursive chunking: Start large, split only when needed
Metadata Filtering
Combine vector search with metadata filters for more precise results. Pre-filter by date, category, source, or other attributes before vector similarity ranking.
Re-ranking
Use a two-stage retrieval pipeline: fast vector search to get candidates, then a more accurate re-ranker (like a cross-encoder) to refine the final ranking.
Vector search excels at finding semantically similar content but should typically be combined with other signals (recency, popularity, exact match) for production systems.