Semantic Cache (Python)

betterdb-semantic-cache is the Python counterpart to @betterdb/semantic-cache. Same architecture, same Valkey data format, same Monitor integration — different language. A TypeScript app and a Python app can share the same cache index.

v0.4.0 ships with full feature parity: LLM-as-judge, reranking, embedding caching, cost tracking, config refresh, discovery, multi-modal prompts, batch lookups, and all framework adapters.

Prerequisites

Valkey 8.0+ with the valkey-search module loaded
Or Amazon ElastiCache for Valkey (8.0+)
Or Google Cloud Memorystore for Valkey
Python >= 3.11

Installation

pip install betterdb-semantic-cache
# With OpenAI embeddings:
pip install betterdb-semantic-cache[openai]
# All extras:
pip install betterdb-semantic-cache[all]

Quick start

import asyncio
import valkey.asyncio as valkey
from betterdb_semantic_cache import SemanticCache, SemanticCacheOptions
from betterdb_semantic_cache.embed.openai import create_openai_embed

async def main():
    client = valkey.Valkey(host="localhost", port=6399)
    cache = SemanticCache(SemanticCacheOptions(
        client=client,
        embed_fn=create_openai_embed(),  # text-embedding-3-small by default
        default_threshold=0.12,
    ))
    await cache.initialize()

    await cache.store("What is the capital of France?", "Paris")

    result = await cache.check("Capital city of France?")
    # result.hit == True
    # result.response == "Paris"
    # result.cost_saved == 0.000105 (based on bundled LiteLLM prices)

asyncio.run(main())

Configuration reference

Option	Type	Default	Description
`name`	`str`	`'betterdb_scache'`	Index name prefix for all Valkey keys
`client`	`valkey.asyncio.Valkey`	required	A valkey-py async client instance
`embed_fn`	`Callable[[str], Awaitable[list[float]]]`	required	Async embedding function
`default_threshold`	`float`	`0.1`	Cosine distance threshold (0–2)
`default_ttl`	`int \\| None`	`None`	Default TTL in seconds
`category_thresholds`	`dict[str, float]`	`{}`	Per-category threshold overrides
`uncertainty_band`	`float`	`0.05`	Width of the uncertainty band below threshold
`cost_table`	`dict[str, ModelCost]`	`{}`	Custom model pricing overrides
`use_default_cost_table`	`bool`	`True`	Merge bundled LiteLLM price table
`embedding_cache.enabled`	`bool`	`True`	Cache computed embeddings in Valkey
`embedding_cache.ttl`	`int`	`86400`	Embedding cache TTL in seconds
`telemetry.tracer_name`	`str`	`'betterdb-semantic-cache'`	OTel tracer name
`telemetry.metrics_prefix`	`str`	`'semantic_cache'`	Prometheus metric name prefix
`config_refresh.enabled`	`bool`	`True`	Periodically re-read threshold config from Valkey
`config_refresh.interval_ms`	`int`	`30000`	Refresh interval in milliseconds (min: 1000)
`discovery.enabled`	`bool`	`True`	Register cache in Valkey for BetterDB Monitor
`discovery.heartbeat_interval_ms`	`int`	`30000`	Heartbeat interval in milliseconds

Adapters

All adapters are submodule imports with optional peer dependencies.

LangChain

from betterdb_semantic_cache.adapters.langchain import BetterDBSemanticCache
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(cache=BetterDBSemanticCache(cache=cache))

OpenAI Chat Completions

from betterdb_semantic_cache.adapters.openai import prepare_semantic_params

text, model = prepare_semantic_params(params)
result = await cache.check(text)

OpenAI Responses API

from betterdb_semantic_cache.adapters.openai_responses import prepare_semantic_params

text = prepare_semantic_params(params)

Anthropic Messages

from betterdb_semantic_cache.adapters.anthropic import prepare_semantic_params

text = prepare_semantic_params(params)

LlamaIndex

from betterdb_semantic_cache.adapters.llamaindex import prepare_semantic_params

text = prepare_semantic_params(messages, model="gpt-4o")

LangGraph (semantic memory store)

from betterdb_semantic_cache.adapters.langgraph import BetterDBSemanticStore

store = BetterDBSemanticStore(cache=cache)
await store.put(("user", "alice", "memories"), "mem1", {"content": "Alice lives in Paris."})
results = await store.search(("user", "alice", "memories"), query="Where does Alice live?")

Embedding helpers

Pre-built EmbedFn factories for common providers:

from betterdb_semantic_cache.embed.openai import create_openai_embed
from betterdb_semantic_cache.embed.bedrock import create_bedrock_embed
from betterdb_semantic_cache.embed.voyage import create_voyage_embed
from betterdb_semantic_cache.embed.cohere import create_cohere_embed
from betterdb_semantic_cache.embed.ollama import create_ollama_embed

Helper	Model default	Dimensions
`create_openai_embed`	`text-embedding-3-small`	1536
`create_bedrock_embed`	`amazon.titan-embed-text-v2:0`	1024
`create_voyage_embed`	`voyage-3-lite`	512
`create_cohere_embed`	`embed-english-v3.0`	1024
`create_ollama_embed`	`nomic-embed-text`	768

LLM-as-judge

When a hit lands in the uncertainty band (threshold - uncertainty_band < score <= threshold), supply a judge_fn to adjudicate automatically:

from betterdb_semantic_cache.types import CacheCheckOptions, JudgeOptions

result = await cache.check(user_prompt, CacheCheckOptions(
    judge=JudgeOptions(
        judge_fn=my_judge,
        on_error="accept",   # fail-open on judge errors (default)
        timeout_ms=2000,     # per-call timeout (default)
    )
))

A minimal OpenAI judge:

from openai import AsyncOpenAI

openai_client = AsyncOpenAI()

async def my_judge(ctx: dict) -> bool:
    # ctx keys: prompt, response, similarity, threshold, category
    # Return True to accept (confidence → 'high')
    # Return False to reject (treated as miss)
    resp = await openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Reply YES or NO only."},
            {"role": "user", "content": (
                f"Does this cached response correctly answer the prompt?\n"
                f"Prompt: {ctx['prompt']}\nResponse: {ctx['response']}"
            )},
        ],
    )
    return (resp.choices[0].message.content or "").startswith("YES")

The judge is only invoked for uncertain hits. High-confidence hits, misses, and no-candidate cases bypass it entirely. When both rerank and judge are set, the judge runs on the reranked pick.

Rerank hook

Retrieve top-k candidates and select the best with a custom function:

Each candidate dict carries response (str), similarity (float, cosine distance), and prompt (str, the stored prompt text for that entry).

from betterdb_semantic_cache.types import CacheCheckOptions, RerankOptions

async def my_rerank(query: str, candidates: list[dict]) -> int:
    # Return index of best candidate, or -1 to reject all
    return 0

result = await cache.check(prompt, CacheCheckOptions(
    rerank=RerankOptions(k=5, rerank_fn=my_rerank),
))

Built-in keyword-overlap reranker

A built-in reranker that blends cosine similarity with word overlap. It catches entity mismatches that cosine similarity alone misses (e.g. “weather in Paris” vs “weather in Berlin”):

from betterdb_semantic_cache import create_keyword_overlap_rerank
from betterdb_semantic_cache.types import CacheCheckOptions, RerankOptions

result = await cache.check(prompt, CacheCheckOptions(
    rerank=RerankOptions(
        k=3,
        rerank_fn=create_keyword_overlap_rerank(compare="prompt"),  # default
    ),
))

compare="prompt" is the equivalence signal (default) — overlaps the incoming query against each candidate’s stored prompt. compare="response" is the relevance signal — overlaps the query against the cached response.

cosine_weight (default 0.7) controls the blend: cosine similarity gets this weight, word overlap gets 1 - cosine_weight. The reranker is a cheap pre-gate, not a quality lift on adversarial-paraphrase inputs.

Cost tracking

Store token counts alongside responses to enable cost savings reporting:

from betterdb_semantic_cache.types import CacheStoreOptions

await cache.store("What is the capital of France?", "Paris", CacheStoreOptions(
    model="gpt-4o",
    input_tokens=25,
    output_tokens=5,
))

result = await cache.check("Capital of France?")
# result.cost_saved == 0.000105 on hit

stats = await cache.stats()
# stats.cost_saved_micros == 105 (microdollars)

Cost is computed using the bundled LiteLLM price table. Override or extend with the cost_table option.

Threshold effectiveness

Analyze the rolling similarity score window for threshold tuning guidance:

analysis = await cache.threshold_effectiveness(min_samples=100)
# analysis.recommendation: 'tighten_threshold' | 'loosen_threshold' | 'optimal' | 'insufficient_data'
# analysis.recommended_threshold: 0.085 (present when actionable)
# analysis.reasoning: 'Human-readable explanation'

When BetterDB Monitor is connected, this data feeds into the Monitor’s self-tuning loop — the Monitor reads the similarity window, generates proposals with reasoning, and writes approved threshold changes back to Valkey. The SDK picks them up via config_refresh.

Batch check

Pipeline multiple lookups in a single round-trip:

results = await cache.check_batch([
    "What is the capital of France?",
    "Who wrote Hamlet?",
    "What is 2 + 2?",
])
# results[0].hit == True, etc.

check_batch() does not support judge. Call check() individually for prompts that need adjudication.

Config refresh and discovery

Config refresh (enabled by default): every 30 seconds the cache re-reads {name}:__config from Valkey and updates the in-memory threshold. When BetterDB Monitor approves a threshold proposal, the running cache picks it up without a restart.

Discovery (enabled by default): on initialize() the cache registers itself in the __betterdb:caches hash and writes a periodic heartbeat. BetterDB Monitor uses this to enumerate live caches for health monitoring and threshold recommendations.

from betterdb_semantic_cache.types import ConfigRefreshOptions, DiscoveryOptions

cache = SemanticCache(SemanticCacheOptions(
    ...,
    config_refresh=ConfigRefreshOptions(enabled=True, interval_ms=30_000),
    discovery=DiscoveryOptions(enabled=True, heartbeat_interval_ms=30_000),
))

Telemetry

The published wheel includes anonymous product analytics powered by PostHog. When enabled, aggregate usage statistics (hit rate, cost saved) are collected on a per-instance basis — no prompt text, responses, or personally-identifiable information is ever sent.

To opt out:

export BETTERDB_TELEMETRY=false   # also accepts: 0, no, off

Or programmatically:

from betterdb_semantic_cache.types import AnalyticsOptions

cache = SemanticCache(SemanticCacheOptions(
    ...,
    analytics=AnalyticsOptions(disabled=True),
))

Interoperability with the TypeScript package

The Python and TypeScript packages use the same Valkey data format: same index schema, same __config hash, same __similarity_window sorted set, same __stats hash. A cache created by one can be read and written by the other. BetterDB Monitor treats them identically.

This means you can:

Store responses from a Python service and serve them from a TypeScript edge function
Run the TypeScript package in production and the Python package in your benchmark harness
Use either language for batch migration or data inspection tools