Embeddings Module¶
Directory: src/selectools/embeddings/
Files: provider.py, openai.py, anthropic.py, gemini.py, cohere.py
Table of Contents¶
- Overview
- Embedding Provider Protocol
- Provider Implementations
- Usage Patterns
- Cost Comparison
- Implementation Details
Overview¶
Embedding Providers convert text into dense vector representations (embeddings) that capture semantic meaning. These vectors enable:
- Semantic search (finding similar content)
- Clustering (grouping related content)
- Classification (categorizing content)
- Recommendation (suggesting similar items)
Why Embeddings?¶
Traditional keyword search fails on:
Keyword match: 0% overlap Semantic similarity: 85% similar
Embeddings capture meaning, not just words.
Embedding Provider Protocol¶
Interface¶
from abc import ABC, abstractmethod
from typing import List
class EmbeddingProvider(ABC):
@abstractmethod
def embed_text(self, text: str) -> List[float]:
"""Embed a single text string."""
pass
@abstractmethod
def embed_texts(self, texts: List[str]) -> List[List[float]]:
"""Embed multiple texts (batch operation)."""
pass
@abstractmethod
def embed_query(self, query: str) -> List[float]:
"""Embed a query (may differ from document embedding)."""
pass
@property
@abstractmethod
def dimension(self) -> int:
"""Get embedding vector dimension."""
pass
Key Methods¶
embed_text(): Single text → embeddingembed_texts(): Multiple texts → embeddings (batched, more efficient)embed_query(): Query text → embedding (some models differentiate)dimension: Vector size (e.g., 1536 for OpenAI text-embedding-3-small)
Provider Implementations¶
OpenAI Embeddings¶
from selectools.embeddings import OpenAIEmbeddingProvider
from selectools.models import OpenAI
embedder = OpenAIEmbeddingProvider(
api_key="sk-...", # Or set OPENAI_API_KEY
model=OpenAI.Embeddings.TEXT_EMBEDDING_3_SMALL.id
)
# Embed single text
embedding = embedder.embed_text("Hello world")
print(f"Dimension: {embedder.dimension}") # 1536
# Embed multiple texts (batched)
embeddings = embedder.embed_texts([
"First document",
"Second document",
"Third document"
])
print(f"Got {len(embeddings)} embeddings")
Available Models:
text-embedding-3-small(1536d, $0.02/1M tokens) - Recommendedtext-embedding-3-large(3072d, $0.13/1M tokens) - Higher qualitytext-embedding-ada-002(1536d, $0.10/1M tokens) - Legacy
Anthropic/Voyage Embeddings¶
from selectools.embeddings import AnthropicEmbeddingProvider
from selectools.models import Anthropic
embedder = AnthropicEmbeddingProvider(
api_key="...", # Or set VOYAGE_API_KEY
model=Anthropic.Embeddings.VOYAGE_3_LITE.id
)
embedding = embedder.embed_text("Hello world")
print(f"Dimension: {embedder.dimension}") # 512
Available Models:
voyage-3(1024d, $0.06/1M tokens) - High qualityvoyage-3-lite(512d, $0.02/1M tokens) - Recommended
Note: Uses Voyage AI API (Anthropic's embedding partner)
Gemini Embeddings¶
from selectools.embeddings import GeminiEmbeddingProvider
from selectools.models import Gemini
embedder = GeminiEmbeddingProvider(
api_key="...", # Or set GEMINI_API_KEY / GOOGLE_API_KEY
model=Gemini.Embeddings.EMBEDDING_004.id
)
embedding = embedder.embed_text("Hello world")
print(f"Dimension: {embedder.dimension}") # 768
Available Models:
text-embedding-004(768d, FREE) - Recommended for costtext-embedding-001(768d, FREE) - Legacy
Best Feature: Completely free (with rate limits)
Cohere Embeddings¶
from selectools.embeddings import CohereEmbeddingProvider
from selectools.models import Cohere
embedder = CohereEmbeddingProvider(
api_key="...", # Or set COHERE_API_KEY
model=Cohere.Embeddings.EMBED_V3.id
)
embedding = embedder.embed_text("Hello world")
print(f"Dimension: {embedder.dimension}") # 1024
Available Models:
embed-english-v3.0(1024d, $0.10/1M tokens)embed-multilingual-v3.0(1024d, $0.10/1M tokens) - 100+ languagesembed-english-light-v3.0(384d, $0.10/1M tokens) - Smaller
Usage Patterns¶
With RAG System¶
from selectools.rag import VectorStore, RAGAgent
from selectools.embeddings import OpenAIEmbeddingProvider
# 1. Create embedding provider
embedder = OpenAIEmbeddingProvider()
# 2. Create vector store
store = VectorStore.create("sqlite", embedder=embedder, db_path="docs.db")
# 3. Create RAG agent
agent = RAGAgent.from_directory(
directory="./docs",
provider=provider,
vector_store=store
)
Manual Embedding¶
# Embed documents
docs = ["Document 1", "Document 2", "Document 3"]
doc_embeddings = embedder.embed_texts(docs)
# Embed query
query = "What is Document 1 about?"
query_embedding = embedder.embed_query(query)
# Compute similarity (cosine similarity)
from numpy import dot
from numpy.linalg import norm
def cosine_similarity(a, b):
return dot(a, b) / (norm(a) * norm(b))
for i, doc_emb in enumerate(doc_embeddings):
similarity = cosine_similarity(query_embedding, doc_emb)
print(f"Document {i+1} similarity: {similarity:.3f}")
Batch Processing¶
# ✅ Good - Batch processing
texts = ["Text 1", "Text 2", ..., "Text 1000"]
embeddings = embedder.embed_texts(texts) # Single API call (batched)
# ❌ Bad - Individual calls
embeddings = [embedder.embed_text(text) for text in texts] # 1000 API calls!
Document vs Query Embedding¶
Some models optimize differently for documents and queries:
# For indexing documents
doc_embeddings = embedder.embed_texts(documents)
# For searching
query_embedding = embedder.embed_query("search query")
Most providers (OpenAI, Gemini) use the same method internally, but Cohere differentiates.
Cost Comparison¶
Per 1M Tokens¶
| Provider | Model | Dimension | Cost | Notes |
|---|---|---|---|---|
| Gemini | text-embedding-004 | 768 | $0.00 | FREE (with limits) |
| OpenAI | text-embedding-3-small | 1536 | $0.02 | Best value |
| Anthropic/Voyage | voyage-3-lite | 512 | $0.02 | Compact |
| Anthropic/Voyage | voyage-3 | 1024 | $0.06 | High quality |
| OpenAI | text-embedding-ada-002 | 1536 | $0.10 | Legacy |
| Cohere | embed-v3 | 1024 | $0.10 | Multilingual |
| OpenAI | text-embedding-3-large | 3072 | $0.13 | Premium |
Cost Example¶
Embedding 1,000 documents (avg 500 tokens each):
Total tokens: 1,000 × 500 = 500,000 tokens
Gemini: $0.00 (FREE)
OpenAI-small: 500k / 1M × $0.02 = $0.01
Voyage-lite: 500k / 1M × $0.02 = $0.01
Voyage-3: 500k / 1M × $0.06 = $0.03
OpenAI-large: 500k / 1M × $0.13 = $0.065
Recommendation¶
- Budget/Hobby Projects: Gemini (free)
- Production (balanced): OpenAI text-embedding-3-small
- Production (premium): OpenAI text-embedding-3-large or Voyage-3
- Multilingual: Cohere embed-multilingual-v3.0
Implementation Details¶
OpenAI Provider¶
class OpenAIEmbeddingProvider(EmbeddingProvider):
def __init__(self, api_key: str | None = None, model: str = "text-embedding-3-small"):
from openai import OpenAI
self.api_key = api_key or os.getenv("OPENAI_API_KEY")
self._client = OpenAI(api_key=self.api_key)
self.model = model
self._dimension = self._get_dimension()
def embed_text(self, text: str) -> List[float]:
response = self._client.embeddings.create(
model=self.model,
input=text
)
return response.data[0].embedding
def embed_texts(self, texts: List[str]) -> List[List[float]]:
response = self._client.embeddings.create(
model=self.model,
input=texts # Batch API call
)
return [item.embedding for item in response.data]
def embed_query(self, query: str) -> List[float]:
return self.embed_text(query) # Same for OpenAI
@property
def dimension(self) -> int:
return self._dimension
def _get_dimension(self) -> int:
# Model-specific dimensions
if "text-embedding-3-small" in self.model:
return 1536
elif "text-embedding-3-large" in self.model:
return 3072
elif "ada-002" in self.model:
return 1536
return 1536 # default
Gemini Provider¶
class GeminiEmbeddingProvider(EmbeddingProvider):
def __init__(self, api_key: str | None = None, model: str = "text-embedding-004"):
import google.generativeai as genai
self.api_key = api_key or os.getenv("GEMINI_API_KEY") or os.getenv("GOOGLE_API_KEY")
genai.configure(api_key=self.api_key)
self.model = model
def embed_text(self, text: str) -> List[float]:
result = genai.embed_content(
model=f"models/{self.model}",
content=text,
task_type="retrieval_document"
)
return result["embedding"]
def embed_texts(self, texts: List[str]) -> List[List[float]]:
# Batch embedding
results = []
for text in texts:
results.append(self.embed_text(text))
return results
def embed_query(self, query: str) -> List[float]:
result = genai.embed_content(
model=f"models/{self.model}",
content=query,
task_type="retrieval_query" # Different task type
)
return result["embedding"]
@property
def dimension(self) -> int:
return 768 # Gemini embeddings are 768-dimensional
Best Practices¶
1. Batch When Possible¶
# ✅ Good
embeddings = embedder.embed_texts(all_docs)
# ❌ Bad
embeddings = [embedder.embed_text(doc) for doc in all_docs]
2. Cache Embeddings¶
# Embed once, store in vector DB
store.add_documents(docs) # Embeddings stored
# Reuse for all queries
results = store.search(query_embedding)
3. Choose Appropriate Dimension¶
Higher dimensions ≠ always better:
- More dimensions → more storage, slower search
- Consider 512-768d for most applications
- Use 1536-3072d for high-precision requirements
4. Monitor Costs¶
from selectools.rag import RAGAgent
agent = RAGAgent.from_directory("./docs", provider, store)
# Check embedding costs
print(agent.usage)
# Shows LLM + embedding costs separately
5. Use Free Tier for Development¶
# Development
embedder = GeminiEmbeddingProvider() # FREE
# Production
embedder = OpenAIEmbeddingProvider() # Paid but reliable
Testing¶
def test_embedding_provider():
embedder = OpenAIEmbeddingProvider()
# Test single embedding
embedding = embedder.embed_text("Hello world")
assert len(embedding) == embedder.dimension
assert all(isinstance(x, float) for x in embedding)
# Test batch embedding
embeddings = embedder.embed_texts(["Text 1", "Text 2"])
assert len(embeddings) == 2
assert len(embeddings[0]) == embedder.dimension
# Test query embedding
query_emb = embedder.embed_query("search query")
assert len(query_emb) == embedder.dimension
def test_embeddings_are_different():
embedder = OpenAIEmbeddingProvider()
emb1 = embedder.embed_text("Python programming")
emb2 = embedder.embed_text("JavaScript coding")
emb3 = embedder.embed_text("Cooking recipes")
# Similar texts should have similar embeddings
sim_prog = cosine_similarity(emb1, emb2)
sim_unrelated = cosine_similarity(emb1, emb3)
assert sim_prog > sim_unrelated # Programming more similar than cooking
Troubleshooting¶
Rate Limits¶
# Issue: Hit rate limits with free tier
# Fix: Add retry logic or use paid tier
import time
def embed_with_retry(embedder, texts, max_retries=3):
for attempt in range(max_retries):
try:
return embedder.embed_texts(texts)
except Exception as e:
if "rate limit" in str(e).lower() and attempt < max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
else:
raise
High Memory Usage¶
# Issue: Embedding millions of documents at once
# Fix: Batch processing
def embed_in_batches(embedder, texts, batch_size=100):
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
embeddings = embedder.embed_texts(batch)
all_embeddings.extend(embeddings)
return all_embeddings
Further Reading¶
- RAG Module - Using embeddings with RAG
- Vector Stores Module - Storing embeddings
- Models Module - Embedding model metadata
Next Steps: Learn about vector storage in the Vector Stores Module.