Advanced Chunking¶
Directory: src/selectools/rag/
Source: chunking.py
Table of Contents¶
- Overview
- Chunking Progression
- Quick Start
- SemanticChunker
- ContextualChunker
- Composability
- Full Pipeline
- Comparison Table
- Best Practices
- Performance Considerations
- Troubleshooting
- Further Reading
Overview¶
Advanced Chunking provides semantic and contextual text splitting strategies that go beyond fixed-size or recursive splitting. These chunkers produce higher-quality chunks for RAG systems by:
- SemanticChunker: Splitting at topic boundaries using embedding similarity (not character count)
- ContextualChunker: Enriching each chunk with LLM-generated context that situates it within the full document
Both chunkers integrate with the selectools RAG pipeline: load → chunk → embed → store → search.
Chunking Progression¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ CHUNKING EVOLUTION │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. Fixed-size (TextSplitter) │
│ └─ Split at every N characters │
│ └─ Fast, predictable, but may cut mid-sentence or mid-topic │
│ │
│ 2. Recursive (RecursiveTextSplitter) │
│ └─ Prefer natural boundaries: \n\n → \n → . → space → char │
│ └─ Better coherence, still character-driven │
│ │
│ 3. Semantic (SemanticChunker) │
│ └─ Split when embedding similarity between consecutive sentences drops │
│ └─ Groups sentences by topic; chunks follow meaning boundaries │
│ │
│ 4. Contextual (ContextualChunker) │
│ └─ Wraps ANY chunker; adds LLM-generated context per chunk │
│ └─ Each chunk prefixed with situating description (Anthropic technique) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Quick Start¶
SemanticChunker¶
from selectools.rag import SemanticChunker, Document
from selectools.embeddings import OpenAIEmbeddingProvider
embedder = OpenAIEmbeddingProvider()
chunker = SemanticChunker(embedder, similarity_threshold=0.75)
# Split text
text = "Topic A sentence 1. Topic A sentence 2. Topic B sentence 1. Topic B sentence 2."
chunks = chunker.split_text(text)
# Split documents
docs = [Document(text=text, metadata={"source": "doc.txt"})]
chunked_docs = chunker.split_documents(docs)
ContextualChunker¶
from selectools.rag import ContextualChunker, RecursiveTextSplitter, Document
from selectools import OpenAIProvider
base = RecursiveTextSplitter(chunk_size=1000, chunk_overlap=200)
provider = OpenAIProvider()
chunker = ContextualChunker(base_chunker=base, provider=provider)
docs = [Document(text="Full document about product features and pricing...")]
enriched_docs = chunker.split_documents(docs)
# Each chunk now has: [Context] <description>\n\n<original chunk>
for doc in enriched_docs:
print(doc.metadata.get("context"))
SemanticChunker¶
How It Works¶
SemanticChunker splits documents at topic boundaries by comparing consecutive sentence embeddings. Consecutive sentences with high cosine similarity stay together; when similarity drops below the threshold, a new chunk starts.
Architecture¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ SEMANTIC CHUNKER FLOW │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Text Input │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Sentence Split │ _split_into_sentences(text) │
│ │ (regex heuristic)│ Handles abbreviations, decimals │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ [Sent1, Sent2, Sent3, ...] │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Embed │ embedder.embed_texts(sentences) │
│ │ (EmbeddingProvider) │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ [[0.1, 0.2, ...], [0.3, 0.4, ...], ...] │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Similarity Check │ _cosine_similarity(emb[i-1], emb[i]) │
│ │ (pure Python) │ No numpy required │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ sim >= threshold? → Keep in group │
│ sim < threshold? → Split (if >= min_chunk_sentences) │
│ len(group) >= max? → Force split │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Group │ Join sentences per chunk │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ [Chunk1, Chunk2, Chunk3, ...] │
│ metadata: chunker="semantic", chunk, total_chunks │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Constructor¶
SemanticChunker(
embedder: EmbeddingProvider,
similarity_threshold: float = 0.75,
min_chunk_sentences: int = 1,
max_chunk_sentences: int = 50,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
embedder |
EmbeddingProvider | — | Provider for sentence embeddings (OpenAI, Gemini, etc.) |
similarity_threshold |
float | 0.75 | Below this cosine similarity, start a new chunk (0.0–1.0) |
min_chunk_sentences |
int | 1 | Minimum sentences before a split is allowed |
max_chunk_sentences |
int | 50 | Maximum sentences per chunk (forces split if exceeded) |
Methods¶
| Method | Returns | Description |
|---|---|---|
split_text(text: str) |
List[str] |
Split raw text into semantic chunks |
split_documents(documents) |
List[Document] |
Split documents, preserve and add metadata |
Metadata Added¶
{
"chunker": "semantic",
"chunk": 0, # Index of this chunk
"total_chunks": 3, # Total chunks from this document
# ... original metadata (source, etc.)
}
Supported Embedding Providers¶
Any provider implementing EmbeddingProvider:
from selectools.embeddings import (
OpenAIEmbeddingProvider,
GeminiEmbeddingProvider,
AnthropicEmbeddingProvider,
CohereEmbeddingProvider,
)
# All work with SemanticChunker
chunker = SemanticChunker(GeminiEmbeddingProvider(), similarity_threshold=0.75) # Free
chunker = SemanticChunker(OpenAIEmbeddingProvider(), similarity_threshold=0.80)
Threshold Tuning Guide¶
| Threshold | Effect | Use Case |
|---|---|---|
| 0.85+ | Fewer, larger chunks | Dense technical docs, single-topic content |
| 0.75 | Balanced (default) | General documents, mixed topics |
| 0.65 | More, smaller chunks | Multi-topic docs, news articles |
| 0.55 | Very fine-grained | Each sentence often its own chunk |
# Dense technical manual
chunker = SemanticChunker(embedder, similarity_threshold=0.85)
# News article with many unrelated sections
chunker = SemanticChunker(embedder, similarity_threshold=0.65)
Implementation Notes¶
- Pure Python cosine similarity: Uses
math.sqrtand built-insum; no numpy dependency. - Sentence splitting: Regex
(?<=[.!?])\s+(?=[A-Z"\'(\[])to avoid splitting on abbreviations (e.g. "Dr.", "U.S.") or decimals. - Helper functions (internal):
_split_into_sentences(text),_cosine_similarity(a, b).
ContextualChunker¶
How It Works¶
ContextualChunker wraps any chunker and enriches each chunk with an LLM-generated description. The LLM sees the full document and the chunk, then produces a 1–2 sentence description that situates the chunk within the document. This description is prepended to the chunk text so that embeddings capture the chunk's role in context.
Inspired by Anthropic's Contextual Retrieval.
Format¶
Each enriched chunk:
Constructor¶
ContextualChunker(
base_chunker: Any, # Object with split_documents()
provider: Provider,
model: str = "gpt-4o-mini",
prompt_template: Optional[str] = None,
max_document_chars: int = 50_000,
context_prefix: str = "[Context] ",
)
| Parameter | Type | Default | Description |
|---|---|---|---|
base_chunker |
Any | — | Chunker with split_documents() method |
provider |
Provider | — | LLM provider for context generation |
model |
str | "gpt-4o-mini" | Model for context generation |
prompt_template |
str or None | None | Custom prompt with {document} and {chunk} |
max_document_chars |
int | 50_000 | Truncate document to this before prompting |
context_prefix |
str | "[Context] " | Prefix before generated description |
Methods¶
| Method | Returns | Description |
|---|---|---|
split_documents(documents) |
List[Document] |
Split and enrich each chunk with context |
split_text(text: str) |
List[str] |
Convenience: wrap in Document, split, return text |
Metadata Added¶
{
"chunker": "contextual",
"context": "LLM-generated description string",
# ... base chunker metadata (chunk, total_chunks, etc.)
}
Default Prompt Template¶
Document:
<document>
{document}
</document>
Chunk:
<chunk>
{chunk}
</chunk>
Give a short (1-2 sentence) description that situates this chunk
within the overall document for search and retrieval purposes.
Respond ONLY with the description, nothing else.
Custom Prompt Examples¶
Technical documentation:
custom_prompt = (
"Document:\n<document>\n{document}\n</document>\n\n"
"Chunk:\n<chunk>\n{chunk}\n</chunk>\n\n"
"Describe this chunk's role in the API documentation in 1-2 sentences. "
"Include the section or feature it covers. Response only."
)
chunker = ContextualChunker(base, provider, prompt_template=custom_prompt)
Legal / contracts:
legal_prompt = (
"Document:\n<document>\n{document}\n</document>\n\n"
"Chunk:\n<chunk>\n{chunk}\n</chunk>\n\n"
"Summarize what legal clause or provision this chunk covers in 1-2 sentences. "
"Be precise. Respond only with the description."
)
chunker = ContextualChunker(base, provider, prompt_template=legal_prompt)
Non-English:
spanish_prompt = (
"Documento:\n<document>\n{document}\n</document>\n\n"
"Fragmento:\n<chunk>\n{chunk}\n</chunk>\n\n"
"Da una descripción breve (1-2 oraciones) que sitúe este fragmento "
"en el documento completo para búsqueda. Responde SOLO con la descripción."
)
chunker = ContextualChunker(base, provider, prompt_template=spanish_prompt)
Compatible Base Chunkers¶
TextSplitterRecursiveTextSplitterSemanticChunker- Any object with
split_documents(documents) -> List[Document]
Composability¶
ContextualChunker wraps any chunker. A common pattern is SemanticChunker as the base for ContextualChunker:
from selectools.rag import SemanticChunker, ContextualChunker, Document
from selectools.embeddings import OpenAIEmbeddingProvider
from selectools import OpenAIProvider
# Semantic splitting for topic boundaries
embedder = OpenAIEmbeddingProvider()
semantic = SemanticChunker(embedder, similarity_threshold=0.75)
# Add LLM-generated context to each chunk
provider = OpenAIProvider()
chunker = ContextualChunker(base_chunker=semantic, provider=provider)
docs = [Document(text="Multi-topic document...", metadata={"source": "doc.txt"})]
enriched = chunker.split_documents(docs)
# Result: semantic boundaries + contextual enrichment
┌────────────────────────────────────────────────────────────┐
│ COMPOSABILITY │
├────────────────────────────────────────────────────────────┤
│ │
│ Document │
│ │ │
│ ▼ │
│ SemanticChunker.split_documents() │
│ │ │
│ ▼ │
│ [Chunk1, Chunk2, Chunk3] (topic boundaries) │
│ │ │
│ ▼ │
│ ContextualChunker (for each chunk) │
│ │ │
│ ├─ LLM: "Chunk1 in context of full doc" │
│ ├─ LLM: "Chunk2 in context of full doc" │
│ └─ LLM: "Chunk3 in context of full doc" │
│ │ │
│ ▼ │
│ [Context] <desc1>\n\nChunk1 │
│ [Context] <desc2>\n\nChunk2 │
│ [Context] <desc3>\n\nChunk3 │
│ │
└────────────────────────────────────────────────────────────┘
Full Pipeline¶
End-to-end flow: Semantic + Contextual → VectorStore → Search
from selectools.rag import (
DocumentLoader,
SemanticChunker,
ContextualChunker,
RecursiveTextSplitter,
VectorStore,
RAGTool,
)
from selectools.embeddings import OpenAIEmbeddingProvider
from selectools import OpenAIProvider, Agent
# 1. Load
docs = DocumentLoader.from_directory("./docs", glob_pattern="**/*.md")
# 2. Chunk (Semantic + Contextual)
embedder = OpenAIEmbeddingProvider()
semantic = SemanticChunker(embedder, similarity_threshold=0.75)
provider = OpenAIProvider()
chunker = ContextualChunker(base_chunker=semantic, provider=provider)
chunked = chunker.split_documents(docs)
# 3. Store
store = VectorStore.create("sqlite", embedder=embedder, db_path="kb.db")
store.add_documents(chunked)
# 4. Search
rag_tool = RAGTool(vector_store=store, top_k=3)
agent = Agent(tools=[rag_tool.search_knowledge_base], provider=provider)
response = agent.run("What are the installation steps?")
Comparison Table¶
| Feature | TextSplitter | RecursiveTextSplitter | SemanticChunker | ContextualChunker |
|---|---|---|---|---|
| Split criterion | Character count | Character + separators | Embedding similarity | Any (wraps base) |
| Topic boundaries | ❌ | Partial | ✅ | ✅ (via context) |
| External API | None | None | Embeddings | LLM completion |
| Cost | Free | Free | Embedding cost | Embedding + LLM |
| Speed | Fast | Fast | Slower (embeds) | Slowest (LLM/embed) |
| Chunk quality | Basic | Good | High | Highest |
| Dependencies | None | None | EmbeddingProvider | Provider + base |
| Use case | Simple RAG | General RAG | Topic-aware RAG | High-precision RAG |
Best Practices¶
1. Use SemanticChunker for Multi-Topic Documents¶
# ✅ Good - topic boundaries
chunker = SemanticChunker(embedder, similarity_threshold=0.75)
# ❌ Less ideal - may cut mid-topic
splitter = RecursiveTextSplitter(chunk_size=1000)
2. Use Gemini Embeddings for Cost Savings¶
from selectools.embeddings import GeminiEmbeddingProvider
# SemanticChunker with free embeddings
embedder = GeminiEmbeddingProvider()
chunker = SemanticChunker(embedder)
3. Reserve ContextualChunker for High-Value Content¶
# Use for critical knowledge bases (legal, medical, support)
# Skip for large, low-stakes corpuses
chunker = ContextualChunker(semantic, provider)
4. Tune similarity_threshold per Domain¶
# Dense technical docs
chunker = SemanticChunker(embedder, similarity_threshold=0.85)
# Mixed articles
chunker = SemanticChunker(embedder, similarity_threshold=0.70)
5. Limit Context Generation Cost with max_document_chars¶
# Truncate very long documents to control prompt size
chunker = ContextualChunker(
base,
provider,
max_document_chars=30_000 # ~7.5k tokens
)
Performance Considerations¶
SemanticChunker¶
- Embedding API calls: One batch per document (all sentences).
- Batch size: Typically 10–200 sentences per document.
- Cost: Embedding cost only (e.g. OpenAI ~$0.02/1M tokens).
ContextualChunker¶
- LLM API calls: One completion per chunk.
- Cost: Embedding (for chunks) + LLM (for context generation).
- Example: 50 chunks × gpt-4o-mini ≈ 50 completion calls; cost scales with chunk count.
# Estimate: 50 chunks, ~200 tokens per completion
# gpt-4o-mini: ~$0.15/1M input, $0.60/1M output
# 50 × 200 ≈ 10k output tokens ≈ $0.006 per document
Recommendations¶
| Scenario | Suggested approach |
|---|---|
| Large corpus, budget | RecursiveTextSplitter or SemanticChunker |
| Small, high-value docs | ContextualChunker (Semantic base) |
| Prototyping | RecursiveTextSplitter |
| Production, mixed usage | SemanticChunker + optional ContextualChunker |
Troubleshooting¶
SemanticChunker: Chunks Too Large¶
# Issue: Single-topic docs produce very large chunks
# Fix: Lower similarity_threshold or reduce max_chunk_sentences
chunker = SemanticChunker(
embedder,
similarity_threshold=0.65,
max_chunk_sentences=25
)
SemanticChunker: Chunks Too Small¶
# Issue: Every sentence becomes its own chunk
# Fix: Raise similarity_threshold
chunker = SemanticChunker(embedder, similarity_threshold=0.85)
ContextualChunker: Slow or Expensive¶
# Issue: Too many chunks → many LLM calls
# Fix: Use a chunker that produces fewer, larger chunks
base = RecursiveTextSplitter(chunk_size=1500, chunk_overlap=300)
chunker = ContextualChunker(base_chunker=base, provider=provider)
ContextualChunker: Truncated Document Context¶
# Issue: Document truncated before relevant section
# Fix: Increase max_document_chars (watch token limits)
chunker = ContextualChunker(
base,
provider,
max_document_chars=80_000 # ~20k tokens
)
Embedding Rate Limits¶
# Issue: SemanticChunker hits embedding API rate limits on large docs
# Fix: Process documents in smaller batches, or use Gemini (higher limits)
embedder = GeminiEmbeddingProvider()
chunker = SemanticChunker(embedder)
Further Reading¶
- RAG Module - Full RAG pipeline and RAGAgent
- Embeddings Module - Embedding providers
- Vector Stores Module - Storage and search
Next Steps: Integrate advanced chunking into your pipeline via RAG Module.