What Are Embeddings?

Embeddings are dense numerical representations of data (text, images, audio) that capture semantic meaning in a continuous vector space. Unlike sparse representations like bag-of-words or TF-IDF, embeddings encode meaning such that similar concepts are mapped to nearby points in the vector space.

Modern text embeddings typically have 768 to 4096 dimensions. Each dimension captures some aspect of meaning, though individual dimensions aren't directly interpretable. The key property is that the distance between vectors corresponds to semantic similarity—documents about similar topics will have embeddings that are close together.

Embedding Models

Proprietary Models

Leading embedding models from major AI providers include:

  • OpenAI text-embedding-3-large: 3072 dimensions, excellent general-purpose performance
  • OpenAI text-embedding-3-small: 1536 dimensions, good balance of quality and cost
  • Cohere embed-v3: Strong multilingual support, good for diverse content
  • Google Vertex AI Embeddings: Tight integration with Google Cloud
  • Voyage AI: Specialized models for code, legal, and other domains

Open-Source Models

High-quality open-source alternatives include:

  • BGE (BAAI General Embedding): Excellent performance, multiple sizes available
  • E5: Microsoft's embedding model with strong benchmark results
  • GTE (General Text Embeddings): Alibaba's competitive embedding model
  • Instructor: Task-specific embeddings with instruction prefixes
  • Nomic Embed: Fully open training data and weights

Choosing an Embedding Model

Key factors when selecting an embedding model:

  • Domain fit: Some models perform better on specific domains (code, legal, medical)
  • Dimension size: Higher dimensions capture more nuance but require more storage
  • Max sequence length: How much text can be embedded at once
  • Latency and cost: API costs for proprietary models, compute costs for self-hosted
  • Multilingual support: Essential for non-English or mixed-language content

Vector Databases

Vector databases are specialized systems for storing and searching high-dimensional embeddings. They use approximate nearest neighbor (ANN) algorithms to find similar vectors efficiently at scale.

Purpose-Built Vector Databases

  • Pinecone: Fully managed, easy to use, excellent for getting started
  • Weaviate: Open-source, supports hybrid search natively
  • Milvus: Open-source, highly scalable, strong community
  • Qdrant: Open-source, Rust-based, good performance
  • Chroma: Lightweight, developer-friendly, good for prototyping

Vector Extensions for Traditional Databases

  • pgvector (PostgreSQL): Add vector search to existing Postgres deployments
  • Elasticsearch kNN: Vector search in Elasticsearch clusters
  • Redis Vector Similarity: In-memory vector search

Similarity Metrics

Different metrics measure vector similarity in different ways:

  • Cosine Similarity: Measures angle between vectors, ignoring magnitude. Most common for text embeddings.
  • Euclidean Distance (L2): Straight-line distance between points. Sensitive to vector magnitude.
  • Dot Product: Fast computation, equivalent to cosine for normalized vectors.
  • Manhattan Distance (L1): Sum of absolute differences, sometimes used for sparse vectors.

For most text embedding applications, cosine similarity is the standard choice because it normalizes for vector length and focuses on directional similarity.

Implementation Best Practices

Chunking Strategies

Long documents must be split into chunks before embedding. Effective chunking strategies include:

  • Fixed-size chunks: Simple but may split semantic units
  • Semantic chunking: Split at paragraph or section boundaries
  • Overlapping windows: Include context from adjacent chunks
  • Recursive chunking: Start large, split only when needed

Metadata Filtering

Combine vector search with metadata filters for more precise results. Pre-filter by date, category, source, or other attributes before vector similarity ranking.

Re-ranking

Use a two-stage retrieval pipeline: fast vector search to get candidates, then a more accurate re-ranker (like a cross-encoder) to refine the final ranking.

Vector search excels at finding semantically similar content but should typically be combined with other signals (recency, popularity, exact match) for production systems.

Common Questions

Vector Search FAQ

Higher dimensions capture more semantic nuance but increase storage and search costs. For most applications, 768-1536 dimensions provide excellent quality. Use larger dimensions (3072+) when you need maximum precision and have the infrastructure to support it. Some embedding providers offer dimension reduction options that let you trade quality for efficiency.

Chunk size depends on your use case. Smaller chunks (100-200 tokens) are better for precise question answering. Larger chunks (500-1000 tokens) preserve more context for summarization or complex reasoning. Start with 200-400 tokens with some overlap and adjust based on retrieval quality. Consider semantic boundaries rather than arbitrary character counts.

Many embedding models already output normalized vectors. If using cosine similarity, normalization ensures consistent results. For dot product search with normalized vectors, results are equivalent to cosine similarity. Check your embedding model's documentation—most modern models handle normalization internally.