Hybrid Search Module¶
Directory: src/selectools/rag/
Files: bm25.py, hybrid.py, tools.py, reranker.py
Table of Contents¶
- Overview
- Architecture
- Quick Start
- BM25 Keyword Search
- HybridSearcher
- HybridSearchTool
- Reranking
- Configuration Guide
- RAGTool vs HybridSearchTool
- Best Practices
- Troubleshooting
- Further Reading
Overview¶
Hybrid search combines two complementary retrieval strategies to improve recall and precision:
| Strategy | Captures | Strengths |
|---|---|---|
| Vector (Semantic) | Meaning, context, synonyms | Finds conceptually similar content even with different wording |
| BM25 (Keyword) | Exact terms, names, acronyms | Matches specific words and phrases precisely |
Why Hybrid Search Matters¶
Semantic search alone can miss important matches:
Query: "GDPR compliance requirements"
Document: "European Union General Data Protection Regulation mandates..."
- Vector search: May find it (similar meaning) ✓
- Keyword search: "GDPR" exact match → high relevance ✓
Keyword search alone fails on paraphrasing:
- Keyword search: No overlap ✗
- Vector search: High semantic similarity ✓
Hybrid search runs both and fuses results, catching cases that either approach might miss.
Import Paths¶
from selectools.rag import (
BM25,
HybridSearcher,
FusionMethod,
HybridSearchTool,
Reranker,
CohereReranker,
JinaReranker,
)
Architecture¶
Flow Diagram¶
┌─────────────────────────────────────────┐
│ USER QUERY │
│ "GDPR compliance requirements" │
└──────────────────┬───────────────────────┘
│
┌────────────────────────────┼────────────────────────────┐
│ │ │
▼ │ ▼
┌───────────────────────┐ │ ┌───────────────────────┐
│ Vector Store │ │ │ BM25 Index │
│ (Semantic Search) │ │ │ (Keyword Search) │
│ • embed_query() │ │ │ • tokenize() │
│ • cosine similarity │ │ │ • Okapi BM25 score │
└───────────┬───────────┘ │ └───────────┬───────────┘
│ │ │
│ vector_top_k │ keyword_top_k │
▼ │ ▼
┌───────────────────────┐ │ ┌───────────────────────┐
│ Vector Results │ │ │ Keyword Results │
│ [SearchResult, ...] │ │ │ [SearchResult, ...] │
└───────────┬───────────┘ │ └───────────┬───────────┘
│ │ │
└────────────────────────────┼────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ FUSION │
│ RRF or Weighted Linear Combination │
│ • Deduplicate documents │
│ • Combine scores │
└──────────────────┬──────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ RERANKER (Optional) │
│ Cohere / Jina cross-encoder │
│ Re-score fused candidates │
└──────────────────┬──────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ FINAL RESULTS │
│ [SearchResult, ...] sorted by score │
└─────────────────────────────────────────┘
Quick Start¶
Minimal Working Example¶
from selectools.rag import Document, VectorStore, HybridSearcher, HybridSearchTool
from selectools.embeddings import OpenAIEmbeddingProvider
from selectools import Agent, OpenAIProvider
# 1. Set up vector store with embeddings
embedder = OpenAIEmbeddingProvider()
store = VectorStore.create("memory", embedder=embedder)
# 2. Create hybrid searcher
searcher = HybridSearcher(vector_store=store)
# 3. Add documents (indexes in both vector store and BM25)
docs = [
Document(text="GDPR requires data protection by design.", metadata={"source": "legal.md"}),
Document(text="Python is a programming language.", metadata={"source": "intro.md"}),
Document(text="European Union data regulation compliance guide.", metadata={"source": "eu.md"}),
]
searcher.add_documents(docs)
# 4. Search
results = searcher.search("GDPR compliance", top_k=3)
for r in results:
print(f"Score: {r.score:.4f} | {r.document.text[:50]}...")
# 5. Use with agent (optional)
hybrid_tool = HybridSearchTool(searcher=searcher, top_k=5)
agent = Agent(tools=[hybrid_tool.search_knowledge_base], provider=OpenAIProvider())
response = agent.run("What does GDPR require?")
BM25 Keyword Search¶
Class: BM25¶
Pure-Python Okapi BM25 keyword search with zero external dependencies. Uses only the Python standard library.
Constructor¶
| Parameter | Default | Description |
|---|---|---|
k1 |
1.5 | Term frequency saturation. Higher = more weight on repeated terms |
b |
0.75 | Length normalisation. 0 = none, 1 = full normalisation |
remove_stopwords |
True | Filter English stop words (a, the, is, etc.) |
Tokenization¶
- Regex-based: Splits on
[^a-z0-9]+(non-alphanumeric) - Lowercase: All text normalised to lowercase
- Stop words: Optional removal of common English words
from selectools.rag import BM25, Document
bm25 = BM25(remove_stopwords=True)
tokens = bm25.tokenize("The quick brown fox jumps over the lazy dog")
# ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
bm25_no_stop = BM25(remove_stopwords=False)
tokens = bm25_no_stop.tokenize("The quick brown fox")
# ['the', 'quick', 'brown', 'fox']
Indexing¶
from selectools.rag import BM25, Document
bm25 = BM25(k1=1.5, b=0.75)
# Build or rebuild index (replaces existing)
docs = [
Document(text="Python programming language"),
Document(text="Java programming language"),
Document(text="Machine learning with Python"),
]
bm25.index_documents(docs)
# Incrementally add documents
bm25.add_documents([Document(text="Rust systems programming")])
print(bm25.document_count) # 4
Search¶
# Returns List[SearchResult] (same format as VectorStore.search)
results = bm25.search("Python programming", top_k=2)
for result in results:
print(f"Score: {result.score:.4f}")
print(f"Text: {result.document.text}")
print(f"Metadata: {result.document.metadata}")
# With metadata filter
results = bm25.search(
"programming",
top_k=5,
filter={"source": "intro.md"},
)
Standalone BM25 Example¶
from selectools.rag import BM25, Document
bm25 = BM25()
docs = [
Document(text="Selectools is an AI agent framework"),
Document(text="Framework for building agents with tools"),
Document(text="AI and ML tool integration"),
]
bm25.index_documents(docs)
results = bm25.search("AI framework", top_k=2)
assert "Selectools" in results[0].document.text or "Framework" in results[0].document.text
HybridSearcher¶
Combines vector (semantic) and BM25 (keyword) retrieval with score fusion.
Constructor¶
HybridSearcher(
vector_store, # VectorStore instance (required)
bm25=None, # Pre-built BM25 or None (creates internal BM25)
vector_weight=0.5, # Weight for semantic results
keyword_weight=0.5, # Weight for keyword results
fusion="rrf", # "rrf" or "weighted"
rrf_k=60, # RRF constant (fusion="rrf" only)
reranker=None, # Optional Reranker for post-fusion re-scoring
)
Fusion Methods¶
from selectools.rag import FusionMethod
# RRF - Reciprocal Rank Fusion (default)
# Rank-based, no score normalisation needed, robust
searcher = HybridSearcher(vector_store=store, fusion=FusionMethod.RRF)
# or: fusion="rrf"
# WEIGHTED - Weighted Linear Combination
# Min-max normalised scores, then weighted sum
searcher = HybridSearcher(vector_store=store, fusion=FusionMethod.WEIGHTED)
# or: fusion="weighted"
| Fusion | Formula | Use When |
|---|---|---|
| RRF | score = w_v/(k+rank_v) + w_k/(k+rank_k) |
Default; handles diverse score scales well |
| WEIGHTED | score = w_v * norm(v) + w_k * norm(k) |
You want explicit score contribution control |
Methods¶
add_documents¶
Add documents to both vector store and BM25 index:
docs = DocumentLoader.from_directory("./docs")
ids = searcher.add_documents(docs)
# Optional: pre-computed embeddings
embeddings = embedder.embed_texts([d.text for d in docs])
ids = searcher.add_documents(docs, embeddings=embeddings)
index_existing_documents¶
Build BM25 index from documents already in the vector store (e.g. pre-populated store):
# Vector store was filled before HybridSearcher was created
store.add_documents(existing_docs)
searcher = HybridSearcher(vector_store=store)
searcher.index_existing_documents(existing_docs)
search¶
results = searcher.search(
query="GDPR data protection",
top_k=5,
filter={"category": "legal"},
vector_top_k=10, # Candidates from vector search (default: top_k * 2)
keyword_top_k=10, # Candidates from BM25 (default: top_k * 2)
)
Deduplication¶
Documents appearing in both vector and keyword result sets are automatically deduplicated before fusion. Matching is by document.text equality.
Complete Example¶
from selectools.rag import Document, VectorStore, HybridSearcher, FusionMethod
from selectools.embeddings import OpenAIEmbeddingProvider
embedder = OpenAIEmbeddingProvider()
store = VectorStore.create("memory", embedder=embedder)
searcher = HybridSearcher(
vector_store=store,
vector_weight=0.6,
keyword_weight=0.4,
fusion=FusionMethod.RRF,
rrf_k=60,
)
docs = [
Document(text="GDPR Article 32: Security of processing", metadata={"source": "gdpr.pdf"}),
Document(text="Data protection by design and default", metadata={"source": "gdpr.pdf"}),
]
searcher.add_documents(docs)
results = searcher.search("Article 32 security measures", top_k=3)
HybridSearchTool¶
Pre-built @tool-decorated search for agent integration. Drop-in replacement for RAGTool with better recall for exact terms, names, and acronyms.
Constructor¶
HybridSearchTool(
searcher, # HybridSearcher instance (required)
top_k=5,
score_threshold=0.0,
include_scores=True,
)
Tool Method: search_knowledge_base¶
The tool the agent calls:
from selectools.rag import HybridSearchTool
from selectools import Agent, OpenAIProvider
hybrid_tool = HybridSearchTool(searcher=searcher, top_k=5)
agent = Agent(
tools=[hybrid_tool.search_knowledge_base],
provider=OpenAIProvider(),
)
response = agent.run("What are the GDPR security requirements?")
Programmatic Method: search¶
Direct search without going through the agent:
results = hybrid_tool.search(
query="installation steps",
filter={"source": "README.md"},
)
# Returns List[SearchResult]
Output Format¶
Same as RAGTool:
[Source 1: gdpr.pdf (page 5), Relevance: 0.8234]
GDPR Article 32 requires appropriate technical and organizational
measures to ensure a level of security...
[Source 2: compliance.md, Relevance: 0.7102]
Security of processing includes encryption and pseudonymization...
Reranking¶
Rerankers use cross-encoder models to re-score candidates from initial retrieval, improving precision over bi-encoder similarity alone.
Reranker ABC¶
from abc import ABC, abstractmethod
class Reranker(ABC):
@abstractmethod
def rerank(
self,
query: str,
results: List[SearchResult],
top_k: Optional[int] = None,
) -> List[SearchResult]:
"""Re-score and re-order results. top_k=None returns all, re-ordered."""
pass
CohereReranker¶
Uses Cohere Rerank API v2:
from selectools.rag import CohereReranker
# pip install cohere # or pip install selectools[rag]
reranker = CohereReranker(
model="rerank-v3.5",
api_key=None, # Defaults to COHERE_API_KEY env var
)
# Standalone
reranked = reranker.rerank("best programming language", candidates, top_k=3)
# With HybridSearcher
searcher = HybridSearcher(
vector_store=store,
reranker=CohereReranker(model="rerank-v3.5"),
)
results = searcher.search("Python vs Java", top_k=5) # Reranked after fusion
JinaReranker¶
Uses Jina AI Rerank API via HTTP:
from selectools.rag import JinaReranker
# pip install requests
# JINA_API_KEY env var or api_key="..."
reranker = JinaReranker(
model="jina-reranker-v2-base-multilingual",
api_key=None,
api_url="https://api.jina.ai/v1/rerank",
)
# With HybridSearcher
searcher = HybridSearcher(
vector_store=store,
reranker=JinaReranker(),
)
results = searcher.search("multilingual document search", top_k=5)
Integration Flow¶
Reranking is applied after fusion. The reranker receives the fused candidate list and returns the final top_k by relevance score.
Configuration Guide¶
Tuning Weights¶
# Semantic-heavy (conceptual queries, paraphrasing)
searcher = HybridSearcher(
vector_store=store,
vector_weight=0.7,
keyword_weight=0.3,
)
# Keyword-heavy (exact terms, IDs, acronyms)
searcher = HybridSearcher(
vector_store=store,
vector_weight=0.3,
keyword_weight=0.7,
)
# Balanced (default)
searcher = HybridSearcher(
vector_store=store,
vector_weight=0.5,
keyword_weight=0.5,
)
Fusion Method Selection¶
| Scenario | Recommended |
|---|---|
| Default, diverse score scales | RRF |
| Need interpretable score contributions | WEIGHTED |
| Many candidate sources | RRF |
Candidate Sizes¶
# Retrieve more candidates for better fusion (default: top_k * 2)
results = searcher.search(
"query",
top_k=5,
vector_top_k=20,
keyword_top_k=20,
)
Reranker Selection¶
| Provider | Model | Best For |
|---|---|---|
| Cohere | rerank-v3.5 | High precision, English-heavy |
| Jina | jina-reranker-v2-base-multilingual | Multilingual content |
RAGTool vs HybridSearchTool¶
| Aspect | RAGTool | HybridSearchTool |
|---|---|---|
| Retrieval | Vector only | Vector + BM25 |
| Constructor | RAGTool(vector_store=...) |
HybridSearchTool(searcher=...) |
| Exact terms | May miss | Better recall |
| Names, acronyms | May miss | Better recall |
| Paraphrasing | Strong | Strong |
| Setup | Simpler | Requires HybridSearcher |
| Dependencies | Embedding provider | Embedding provider + BM25 (no extra deps) |
When to Use RAGTool¶
- Simple RAG, fast setup
- Purely conceptual/semantic queries
- Minimal infrastructure
When to Use HybridSearchTool¶
- Queries with exact terms, names, IDs, acronyms (e.g. "GDPR Article 32", "API v2")
- Technical documentation with code identifiers
- When recall matters more than simplicity
Migration Example¶
# Before (RAGTool)
rag_tool = RAGTool(vector_store=store, top_k=5)
agent = Agent(tools=[rag_tool.search_knowledge_base], provider=provider)
# After (HybridSearchTool) - drop-in replacement
searcher = HybridSearcher(vector_store=store)
searcher.add_documents(docs) # or index_existing_documents if store already full
hybrid_tool = HybridSearchTool(searcher=searcher, top_k=5)
agent = Agent(tools=[hybrid_tool.search_knowledge_base], provider=provider)
Best Practices¶
1. Use Hybrid Search for Mixed Queries¶
# Queries with both concepts and exact terms benefit most
searcher.search("OpenAI GPT-4 API rate limits") # "GPT-4" exact, "rate limits" semantic
2. Index Once, Search Many¶
# Add documents once
searcher.add_documents(docs)
# Search repeatedly
for query in user_queries:
results = searcher.search(query, top_k=5)
3. Add Reranking for Quality-Critical Use Cases¶
searcher = HybridSearcher(
vector_store=store,
reranker=CohereReranker(),
)
# Better precision at cost of latency and API usage
4. Tune BM25 for Your Corpus¶
# Longer documents: increase b for length normalisation
bm25 = BM25(k1=1.5, b=0.8)
# Shorter chunks: reduce b
bm25 = BM25(k1=1.5, b=0.5)
5. Use Metadata Filters Consistently¶
# Both vector and BM25 receive the same filter
results = searcher.search(
"query",
filter={"category": "api", "version": "v2"},
)
Troubleshooting¶
No Results from BM25¶
# Issue: All terms filtered as stop words
bm25 = BM25(remove_stopwords=True)
results = bm25.search("a the is", top_k=5) # Empty
# Fix: Disable stop word removal for query-heavy content
bm25 = BM25(remove_stopwords=False)
Vector Store Has No Embedder¶
# Error: "Vector store does not have an embedding provider configured."
# Fix: Pass embedder when creating store
store = VectorStore.create("memory", embedder=embedder)
searcher = HybridSearcher(vector_store=store)
Reranker API Errors¶
# Cohere: Set COHERE_API_KEY
# Jina: Set JINA_API_KEY or pass api_key to constructor
import os
os.environ["COHERE_API_KEY"] = "your-key"
# or
reranker = JinaReranker(api_key="your-jina-key")
Duplicate Documents in Results¶
HybridSearcher deduplicates by document.text. Ensure documents with identical text are intended (e.g. same chunk from different sources). If you see duplicates, check that Document instances are not being recreated with different object identities but same text.
Slow Search¶
# Issue: Large vector_top_k and keyword_top_k
# Fix: Reduce candidate sizes
results = searcher.search(
query,
top_k=5,
vector_top_k=10,
keyword_top_k=10,
)
Further Reading¶
- RAG Module - Complete RAG pipeline, document loading, chunking
- Embeddings Module - Embedding providers for vector search
- Vector Stores Module - VectorStore implementations
- Tools Module - Tool decorator and agent integration
Next Steps: Integrate hybrid search into your RAG pipeline by following the RAG Module and swapping RAGTool for HybridSearchTool when exact term recall matters.