Semantic Cache¶

Added in: v0.17.7 File: src/selectools/cache_semantic.py

Overview¶

SemanticCache is a drop-in replacement for InMemoryCache that serves cached LLM responses for semantically equivalent queries — even when the exact wording differs.

Instead of exact-string key matching, it embeds each cache key using any EmbeddingProvider and compares incoming queries via cosine similarity. A hit is returned when the best match exceeds a configurable similarity_threshold.

from selectools.cache_semantic import SemanticCache
from selectools.embeddings.openai import OpenAIEmbeddingProvider

cache = SemanticCache(
    embedding_provider=OpenAIEmbeddingProvider(),
    similarity_threshold=0.92,
    max_size=500,
    default_ttl=3600,
)
config = AgentConfig(cache=cache)

# "What's the weather in NYC?" hits cache for "Weather in New York City?"

Quick Start¶

from selectools import Agent, AgentConfig
from selectools.cache_semantic import SemanticCache
from selectools.embeddings.openai import OpenAIEmbeddingProvider

cache = SemanticCache(
    embedding_provider=OpenAIEmbeddingProvider(),
    similarity_threshold=0.92,
)

agent = Agent(
    tools=[...],
    config=AgentConfig(cache=cache),
)

r1 = agent.run("What's the weather in NYC?")
r2 = agent.run("Weather in New York City?")  # cache hit — no LLM call

cache_hit = any(s.type.value == "cache_hit" for s in r2.trace.steps)
print(cache_hit)  # True
print(cache.stats)  # CacheStats(hits=1, misses=1, evictions=0, hit_rate=50.00%)

Constructor Parameters¶

Parameter	Type	Default	Description
`embedding_provider`	`EmbeddingProvider`	—	Provides `embed_text()` and `embed_query()`. Required.
`similarity_threshold`	`float`	`0.92`	Minimum cosine similarity for a cache hit. Range: `[0.0, 1.0]`.
`max_size`	`int`	`1000`	Maximum entries before LRU eviction. `0` = unbounded.
`default_ttl`	`Optional[int]`	`None`	Default TTL in seconds. `None` = no expiry.

Methods¶

`get(key: str) → Optional[Tuple[Any, Any]]`¶

Embeds key and scans stored entries for the best cosine similarity match.

Returns the cached value if best score ≥ similarity_threshold and entry has not expired.
Moves the matched entry to the end of the LRU list on hit.
Returns None on miss.

`set(key: str, value: Tuple[Any, Any], ttl: Optional[int] = None) → None`¶

Embeds key and stores the entry.

Replaces an existing entry if the exact key already exists.
Evicts the LRU entry when max_size is reached.
TTL overrides default_ttl if provided.

`delete(key: str) → bool`¶

Removes an entry by exact original key. Returns True if found and removed.

`clear() → None`¶

Removes all entries and resets CacheStats.

`stats → CacheStats`¶

Read-only snapshot of hit/miss/eviction counters and hit_rate.

`size → int`¶

Number of entries currently stored (includes expired entries not yet pruned).

How Similarity Works¶

Cosine similarity is computed in pure Python (no NumPy):

similarity(a, b) = dot(a, b) / (‖a‖ · ‖b‖)

Embeddings are normalised to unit vectors by most providers, so this reduces to a dot product. No external dependencies are required beyond the embedding provider itself.

Threshold Guide¶

Threshold	Behaviour
`0.99–1.0`	Near-exact matches only (minor typo tolerance)
`0.92–0.98`	Paraphrases and synonyms (recommended for general use)
`0.80–0.92`	Loose topic-level similarity
`< 0.80`	Very permissive; may cause false hits

TTL and Expiry¶

cache = SemanticCache(
    embedding_provider=ep,
    default_ttl=3600,  # 1-hour default TTL
)

# Override TTL per entry
cache.set("weather nyc", response, ttl=300)  # 5-minute TTL for this entry

Expired entries are not eagerly pruned — they are skipped during get() scans. They are evicted naturally when max_size is reached or clear() is called.

LRU Eviction¶

When size == max_size, the least-recently-used entry (front of the list) is evicted on the next set(). Accessing an entry via get() moves it to the end (most-recently-used position).

Thread Safety¶

All public methods acquire an internal threading.Lock. SemanticCache is safe to share across threads.

Trace Integration¶

When a cache hit occurs through the agent, a CACHE_HIT step is appended to AgentTrace:

result = agent.run("What's the weather?")
for step in result.trace.steps:
    if step.type == StepType.CACHE_HIT:
        print(f"Cache hit: {step.summary}")

Comparison with InMemoryCache¶

Feature	`InMemoryCache`	`SemanticCache`
Key matching	Exact string	Cosine similarity
Extra dependency	None	EmbeddingProvider
Hit on paraphrase	No	Yes
LRU eviction	Yes	Yes
TTL support	Yes	Yes
Thread-safe	Yes	Yes

Example¶

See examples/52_semantic_cache.py for a runnable demo using a mock embedding provider.