Semantic Cache (Python)

betterdb-semantic-cache is the Python counterpart to @betterdb/semantic-cache. Same architecture, same Valkey data format, same Monitor integration — different language. A TypeScript app and a Python app can share the same cache index.

v0.4.0 ships with full feature parity: LLM-as-judge, reranking, embedding caching, cost tracking, config refresh, discovery, multi-modal prompts, batch lookups, and all framework adapters.

Prerequisites

  • Valkey 8.0+ with the valkey-search module loaded
  • Or Amazon ElastiCache for Valkey (8.0+)
  • Or Google Cloud Memorystore for Valkey
  • Python >= 3.11

Installation

pip install betterdb-semantic-cache
# With OpenAI embeddings:
pip install betterdb-semantic-cache[openai]
# All extras:
pip install betterdb-semantic-cache[all]

Quick start

import asyncio
import valkey.asyncio as valkey
from betterdb_semantic_cache import SemanticCache, SemanticCacheOptions
from betterdb_semantic_cache.embed.openai import create_openai_embed

async def main():
    client = valkey.Valkey(host="localhost", port=6399)
    cache = SemanticCache(SemanticCacheOptions(
        client=client,
        embed_fn=create_openai_embed(),  # text-embedding-3-small by default
        default_threshold=0.12,
    ))
    await cache.initialize()

    await cache.store("What is the capital of France?", "Paris")

    result = await cache.check("Capital city of France?")
    # result.hit == True
    # result.response == "Paris"
    # result.cost_saved == 0.000105 (based on bundled LiteLLM prices)

asyncio.run(main())

Configuration reference

Option Type Default Description
name str 'betterdb_scache' Index name prefix for all Valkey keys
client valkey.asyncio.Valkey required A valkey-py async client instance
embed_fn Callable[[str], Awaitable[list[float]]] required Async embedding function
default_threshold float 0.1 Cosine distance threshold (0–2)
default_ttl int \| None None Default TTL in seconds
category_thresholds dict[str, float] {} Per-category threshold overrides
uncertainty_band float 0.05 Width of the uncertainty band below threshold
cost_table dict[str, ModelCost] {} Custom model pricing overrides
use_default_cost_table bool True Merge bundled LiteLLM price table
embedding_cache.enabled bool True Cache computed embeddings in Valkey
embedding_cache.ttl int 86400 Embedding cache TTL in seconds
telemetry.tracer_name str 'betterdb-semantic-cache' OTel tracer name
telemetry.metrics_prefix str 'semantic_cache' Prometheus metric name prefix
config_refresh.enabled bool True Periodically re-read threshold config from Valkey
config_refresh.interval_ms int 30000 Refresh interval in milliseconds (min: 1000)
discovery.enabled bool True Register cache in Valkey for BetterDB Monitor
discovery.heartbeat_interval_ms int 30000 Heartbeat interval in milliseconds

Adapters

All adapters are submodule imports with optional peer dependencies.

LangChain

from betterdb_semantic_cache.adapters.langchain import BetterDBSemanticCache
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(cache=BetterDBSemanticCache(cache=cache))

OpenAI Chat Completions

from betterdb_semantic_cache.adapters.openai import prepare_semantic_params

text, model = prepare_semantic_params(params)
result = await cache.check(text)

OpenAI Responses API

from betterdb_semantic_cache.adapters.openai_responses import prepare_semantic_params

text = prepare_semantic_params(params)

Anthropic Messages

from betterdb_semantic_cache.adapters.anthropic import prepare_semantic_params

text = prepare_semantic_params(params)

LlamaIndex

from betterdb_semantic_cache.adapters.llamaindex import prepare_semantic_params

text = prepare_semantic_params(messages, model="gpt-4o")

LangGraph (semantic memory store)

from betterdb_semantic_cache.adapters.langgraph import BetterDBSemanticStore

store = BetterDBSemanticStore(cache=cache)
await store.put(("user", "alice", "memories"), "mem1", {"content": "Alice lives in Paris."})
results = await store.search(("user", "alice", "memories"), query="Where does Alice live?")

Embedding helpers

Pre-built EmbedFn factories for common providers:

from betterdb_semantic_cache.embed.openai import create_openai_embed
from betterdb_semantic_cache.embed.bedrock import create_bedrock_embed
from betterdb_semantic_cache.embed.voyage import create_voyage_embed
from betterdb_semantic_cache.embed.cohere import create_cohere_embed
from betterdb_semantic_cache.embed.ollama import create_ollama_embed
Helper Model default Dimensions
create_openai_embed text-embedding-3-small 1536
create_bedrock_embed amazon.titan-embed-text-v2:0 1024
create_voyage_embed voyage-3-lite 512
create_cohere_embed embed-english-v3.0 1024
create_ollama_embed nomic-embed-text 768

LLM-as-judge

When a hit lands in the uncertainty band (threshold - uncertainty_band < score <= threshold), supply a judge_fn to adjudicate automatically:

from betterdb_semantic_cache.types import CacheCheckOptions, JudgeOptions

result = await cache.check(user_prompt, CacheCheckOptions(
    judge=JudgeOptions(
        judge_fn=my_judge,
        on_error="accept",   # fail-open on judge errors (default)
        timeout_ms=2000,     # per-call timeout (default)
    )
))

A minimal OpenAI judge:

from openai import AsyncOpenAI

openai_client = AsyncOpenAI()

async def my_judge(ctx: dict) -> bool:
    # ctx keys: prompt, response, similarity, threshold, category
    # Return True to accept (confidence → 'high')
    # Return False to reject (treated as miss)
    resp = await openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Reply YES or NO only."},
            {"role": "user", "content": (
                f"Does this cached response correctly answer the prompt?\n"
                f"Prompt: {ctx['prompt']}\nResponse: {ctx['response']}"
            )},
        ],
    )
    return (resp.choices[0].message.content or "").startswith("YES")

The judge is only invoked for uncertain hits. High-confidence hits, misses, and no-candidate cases bypass it entirely. When both rerank and judge are set, the judge runs on the reranked pick.

Rerank hook

Retrieve top-k candidates and select the best with a custom function:

from betterdb_semantic_cache.types import CacheCheckOptions, RerankOptions

async def my_rerank(query: str, candidates: list[dict]) -> int:
    # Return index of best candidate, or -1 to reject all
    return 0

result = await cache.check(prompt, CacheCheckOptions(
    rerank=RerankOptions(k=5, rerank_fn=my_rerank),
))

Cost tracking

Store token counts alongside responses to enable cost savings reporting:

from betterdb_semantic_cache.types import CacheStoreOptions

await cache.store("What is the capital of France?", "Paris", CacheStoreOptions(
    model="gpt-4o",
    input_tokens=25,
    output_tokens=5,
))

result = await cache.check("Capital of France?")
# result.cost_saved == 0.000105 on hit

stats = await cache.stats()
# stats.cost_saved_micros == 105 (microdollars)

Cost is computed using the bundled LiteLLM price table. Override or extend with the cost_table option.

Threshold effectiveness

Analyze the rolling similarity score window for threshold tuning guidance:

analysis = await cache.threshold_effectiveness(min_samples=100)
# analysis.recommendation: 'tighten_threshold' | 'loosen_threshold' | 'optimal' | 'insufficient_data'
# analysis.recommended_threshold: 0.085 (present when actionable)
# analysis.reasoning: 'Human-readable explanation'

When BetterDB Monitor is connected, this data feeds into the Monitor’s self-tuning loop — the Monitor reads the similarity window, generates proposals with reasoning, and writes approved threshold changes back to Valkey. The SDK picks them up via config_refresh.

Batch check

Pipeline multiple lookups in a single round-trip:

results = await cache.check_batch([
    "What is the capital of France?",
    "Who wrote Hamlet?",
    "What is 2 + 2?",
])
# results[0].hit == True, etc.

check_batch() does not support judge. Call check() individually for prompts that need adjudication.

Config refresh and discovery

Config refresh (enabled by default): every 30 seconds the cache re-reads {name}:__config from Valkey and updates the in-memory threshold. When BetterDB Monitor approves a threshold proposal, the running cache picks it up without a restart.

Discovery (enabled by default): on initialize() the cache registers itself in the __betterdb:caches hash and writes a periodic heartbeat. BetterDB Monitor uses this to enumerate live caches for health monitoring and threshold recommendations.

from betterdb_semantic_cache.types import ConfigRefreshOptions, DiscoveryOptions

cache = SemanticCache(SemanticCacheOptions(
    ...,
    config_refresh=ConfigRefreshOptions(enabled=True, interval_ms=30_000),
    discovery=DiscoveryOptions(enabled=True, heartbeat_interval_ms=30_000),
))

Telemetry

The published wheel includes anonymous product analytics powered by PostHog. When enabled, aggregate usage statistics (hit rate, cost saved) are collected on a per-instance basis — no prompt text, responses, or personally-identifiable information is ever sent.

To opt out:

export BETTERDB_TELEMETRY=false   # also accepts: 0, no, off

Or programmatically:

from betterdb_semantic_cache.types import AnalyticsOptions

cache = SemanticCache(SemanticCacheOptions(
    ...,
    analytics=AnalyticsOptions(disabled=True),
))

Interoperability with the TypeScript package

The Python and TypeScript packages use the same Valkey data format: same index schema, same __config hash, same __similarity_window sorted set, same __stats hash. A cache created by one can be read and written by the other. BetterDB Monitor treats them identically.

This means you can:

  • Store responses from a Python service and serve them from a TypeScript edge function
  • Run the TypeScript package in production and the Python package in your benchmark harness
  • Use either language for batch migration or data inspection tools