Semantic Cache
@betterdb/semantic-cache is a standalone, framework-agnostic semantic cache library for LLM applications backed by Valkey. It uses the valkey-search module’s vector similarity search to match incoming prompts against previously cached responses, returning hits when the cosine distance falls below a configurable threshold.
v0.2.0 adds full adapter parity with agent-cache: OpenAI, Anthropic, LlamaIndex, LangGraph, multi-modal prompt support, cost tracking, threshold effectiveness recommendations, embedding caching, batch lookups, and more.
Prerequisites
- Valkey 8.0+ with the
valkey-searchmodule loaded - Or Amazon ElastiCache for Valkey (8.0+)
- Or Google Cloud Memorystore for Valkey
- Node.js >= 20
Installation
npm install @betterdb/semantic-cache iovalkey
iovalkey is a peer dependency - you must install it alongside the package.
Quick start
import Valkey from 'iovalkey';
import { SemanticCache } from '@betterdb/semantic-cache';
import { createOpenAIEmbed } from '@betterdb/semantic-cache/embed/openai';
const client = new Valkey({ host: 'localhost', port: 6399 });
const cache = new SemanticCache({
client,
embedFn: createOpenAIEmbed(), // text-embedding-3-small by default
defaultThreshold: 0.1,
defaultTtl: 3600,
});
await cache.initialize();
await cache.store('What is the capital of France?', 'Paris', {
model: 'gpt-4o',
inputTokens: 20,
outputTokens: 5,
});
const result = await cache.check('Capital city of France?');
// result.hit === true
// result.response === 'Paris'
// result.costSaved === 0.000105 (based on bundled LiteLLM prices)
Configuration reference
| Option | Type | Default | Description |
|---|---|---|---|
name | string | 'betterdb_scache' | Index name prefix for all Valkey keys |
client | Valkey | required | An iovalkey client instance |
embedFn | (text: string) => Promise<number[]> | required | Embedding function |
defaultThreshold | number | 0.1 | Cosine distance threshold (0-2) |
defaultTtl | number | undefined | Default TTL in seconds |
categoryThresholds | Record<string, number> | {} | Per-category threshold overrides |
uncertaintyBand | number | 0.05 | Width of the uncertainty band below threshold |
costTable | Record<string, ModelCost> | undefined | Custom model pricing overrides |
useDefaultCostTable | boolean | true | Merge bundled LiteLLM price table |
normalizer | BinaryNormalizer | defaultNormalizer | Binary content normalizer for multi-modal prompts |
embeddingCache.enabled | boolean | true | Cache computed embeddings in Valkey |
embeddingCache.ttl | number | 86400 | Embedding cache TTL in seconds |
telemetry.tracerName | string | '@betterdb/semantic-cache' | OTel tracer name |
telemetry.metricsPrefix | string | 'semantic_cache' | Prometheus metric name prefix |
telemetry.registry | Registry | prom-client default | Custom prom-client Registry |
Threshold and confidence
@betterdb/semantic-cache uses cosine distance (0-2 scale, lower = more similar):
| Distance | Meaning |
|---|---|
| 0.00 | Identical vectors |
| 0.05-0.10 | Strong paraphrase |
| 0.10-0.20 | Loose paraphrase / related topic |
| 1.00 | Orthogonal (unrelated) |
A lookup is a hit when score <= threshold. The default threshold is 0.1, set per cache via defaultThreshold, per category via categoryThresholds, or per request via the threshold check option.
Confidence levels
Every hit is graded against the uncertaintyBand (default 0.05), the width of the band immediately below the threshold:
confidence | When | What to do |
|---|---|---|
high | score <= threshold - uncertaintyBand (e.g. <= 0.05) | Return the cached response directly |
uncertain | threshold - uncertaintyBand < score <= threshold (e.g. 0.05–0.10) | Return the response but consider flagging for review, or hand it to the LLM-as-judge |
miss | score > threshold | No hit - call the LLM |
Recommended thresholds by use case:
| Use case | Threshold | Notes |
|---|---|---|
| FAQ / exact match only | 0.05 | Very strict, near-zero false positives |
| Standard Q&A | 0.10 | Default - paraphrases land as uncertain |
| Conversational / RAG | 0.15 | Paraphrases hit as high confidence |
| Broad search / recall | 0.20 | High hit rate, review uncertain hits |
Adapters
All adapters are subpath exports with optional peer dependencies.
LangChain
import { BetterDBSemanticCache } from '@betterdb/semantic-cache/langchain';
const llm = new ChatOpenAI({ cache: new BetterDBSemanticCache({ cache }) });
Vercel AI SDK
import { createSemanticCacheMiddleware } from '@betterdb/semantic-cache/ai';
const model = wrapLanguageModel({ model: openai('gpt-4o'), middleware: createSemanticCacheMiddleware({ cache }) });
OpenAI Chat Completions
import { prepareSemanticParams } from '@betterdb/semantic-cache/openai';
const { text, model } = await prepareSemanticParams(params);
const result = await cache.check(text);
OpenAI Responses API
import { prepareSemanticParams } from '@betterdb/semantic-cache/openai-responses';
const { text } = await prepareSemanticParams(params);
Anthropic Messages
import { prepareSemanticParams } from '@betterdb/semantic-cache/anthropic';
const { text } = await prepareSemanticParams(params);
LlamaIndex
import { prepareSemanticParams } from '@betterdb/semantic-cache/llamaindex';
const { text } = await prepareSemanticParams(messages, { model: 'gpt-4o' });
LangGraph (semantic memory store)
import { BetterDBSemanticStore } from '@betterdb/semantic-cache/langgraph';
const store = new BetterDBSemanticStore({ cache });
await store.put(['user', 'alice', 'memories'], 'mem1', { content: 'Alice lives in Paris.' });
const results = await store.search(['user', 'alice', 'memories'], { query: 'Where does Alice live?' });
Use BetterDBSemanticStore for similarity-based memory retrieval. For exact-match checkpoint persistence, use @betterdb/agent-cache/langgraph.
Embedding helpers
Pre-built EmbedFn factories for common providers:
import { createOpenAIEmbed } from '@betterdb/semantic-cache/embed/openai';
import { createBedrockEmbed } from '@betterdb/semantic-cache/embed/bedrock';
import { createVoyageEmbed } from '@betterdb/semantic-cache/embed/voyage';
import { createCohereEmbed } from '@betterdb/semantic-cache/embed/cohere';
import { createOllamaEmbed } from '@betterdb/semantic-cache/embed/ollama';
| Helper | Model default | Dimensions |
|---|---|---|
createOpenAIEmbed | text-embedding-3-small | 1536 |
createBedrockEmbed | amazon.titan-embed-text-v2:0 | 1024 |
createVoyageEmbed | voyage-3-lite | 512 |
createCohereEmbed | embed-english-v3.0 | 1024 |
createOllamaEmbed | nomic-embed-text | 768 |
Cost tracking
Store token counts alongside responses to enable cost savings reporting:
await cache.store('What is the capital of France?', 'Paris', {
model: 'gpt-4o',
inputTokens: 25,
outputTokens: 5,
});
const result = await cache.check('Capital of France?');
// result.costSaved === 0.000105 on hit
const stats = await cache.stats();
// stats.costSavedMicros === 105 (microdollars)
Cost is computed using the bundled LiteLLM price table (1,971 models). Override or extend with costTable option.
Multi-modal prompts
Use ContentBlock[] to cache prompts with binary content:
import { hashBase64, type ContentBlock } from '@betterdb/semantic-cache';
const prompt: ContentBlock[] = [
{ type: 'text', text: 'Describe this image.' },
{ type: 'binary', kind: 'image', mediaType: 'image/png', ref: hashBase64(imageBase64) },
];
await cache.store(prompt, 'A red square.');
const result = await cache.check(prompt); // hit only if text AND image match
Use storeMultipart() to store structured response blocks:
const blocks: ContentBlock[] = [
{ type: 'text', text: 'The answer is 42.' },
{ type: 'reasoning', text: 'By my calculation...' },
];
await cache.storeMultipart(prompt, blocks);
const result = await cache.check(prompt);
// result.contentBlocks === blocks
Threshold effectiveness recommendations
Analyze the rolling similarity score window for threshold tuning guidance:
const analysis = await cache.thresholdEffectiveness({ minSamples: 100 });
// analysis.recommendation: 'tighten_threshold' | 'loosen_threshold' | 'optimal' | 'insufficient_data'
// analysis.recommendedThreshold: 0.085 (present when recommendation is not optimal/insufficient)
// analysis.reasoning: 'Human-readable explanation'
// Per-category analysis
const allCategories = await cache.thresholdEffectivenessAll();
Batch check
Pipeline multiple lookups in a single round-trip:
const results = await cache.checkBatch([
'What is the capital of France?',
'Who wrote Hamlet?',
'What is 2 + 2?',
]);
// results[0].hit === true, etc.
Stale model eviction
Automatically evict cache entries when the model changes:
const result = await cache.check('What is 2+2?', {
staleAfterModelChange: true,
currentModel: 'gpt-4o',
});
// If the cached entry was stored with model='gpt-3.5-turbo', it's evicted and treated as miss
LLM-as-judge
When a hit lands in the uncertainty band (threshold - uncertaintyBand < score <= threshold), supply a judgeFn to adjudicate the borderline hit automatically instead of handling confidence: 'uncertain' yourself. This adds a second step on top of the similarity check: check() first classifies the hit, and only uncertain hits are passed to the judge, which accepts (promotes to high) or rejects (treats as a miss).
const result = await cache.check(userPrompt, {
judge: {
judgeFn: async ({ prompt, response, similarity, threshold, category }) => {
// Return true to accept (confidence → 'high')
// Return false to reject (treated as a miss with nearestMiss)
const verdict = await openai.chat.completions.create({
model: 'gpt-5-mini',
messages: [
{ role: 'system', content: 'Reply YES or NO only.' },
{ role: 'user', content: `Does this cached response correctly answer the prompt?\nPrompt: ${prompt}\nResponse: ${response}` },
],
});
return verdict.choices[0].message.content?.startsWith('YES') ?? false;
},
onError: 'accept', // fail-open on judge errors (default)
timeoutMs: 2000, // per-call timeout (default)
},
});
When the judge is invoked: only for confidence === 'uncertain' hits. High-confidence hits, misses, and the zero-candidates case bypass the judge entirely.
Accept path: result.hit === true, result.confidence === 'high'.
Reject path: result.hit === false, result.nearestMiss populated with deltaToThreshold <= 0 (use this to distinguish judge rejections from regular misses, where deltaToThreshold > 0).
Composing with rerank: when both rerank and judge are set, the judge receives the reranked pick’s response and similarity score.
checkBatch() does not support judge. Call check() individually for prompts that need adjudication.
Rerank hook
Retrieve top-k candidates and select the best with a custom function:
const result = await cache.check(prompt, {
rerank: {
k: 5,
rerankFn: async (query, candidates) => {
// Return index of best candidate, or -1 to reject all
return candidates.findIndex((c) => c.response.length > 50);
},
},
});
Params-aware filtering
Store sampling parameters as indexed NUMERIC fields for opt-in filtering:
await cache.store(prompt, response, { temperature: 0.7, topP: 0.9, seed: 42 });
const result = await cache.check(prompt, { filter: '@temperature:[0 0]' });
Invalidation helpers
await cache.invalidateByModel('gpt-4o'); // delete all entries for a model
await cache.invalidateByCategory('geography'); // delete all entries for a category
Observability
Prometheus metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
{prefix}_requests_total | Counter | cache_name, result, category | Total lookups (result: hit/miss/uncertain_hit) |
{prefix}_similarity_score | Histogram | cache_name, category | Cosine distance on every lookup with a candidate |
{prefix}_operation_duration_seconds | Histogram | cache_name, operation | End-to-end operation duration |
{prefix}_embedding_duration_seconds | Histogram | cache_name | Time in embedFn |
{prefix}_cost_saved_total | Counter | cache_name, category | Cumulative dollars saved from cache hits |
{prefix}_embedding_cache_total | Counter | cache_name, result | Embedding cache hit/miss counts |
{prefix}_stale_model_evictions_total | Counter | cache_name | Entries evicted by staleAfterModelChange |
Known limitations
Cluster mode
flush() and embedding cache cleanup use SCAN. In Valkey Cluster mode, SCAN on a single node only iterates that node’s keys. v0.2.0 uses clusterScan() (same pattern as agent-cache) to fan out across all master nodes for these operations.
The FT.CREATE index and FT.SEARCH queries work correctly in cluster mode because Valkey routes them to the appropriate node. However, FT.CREATE creates the index only on the node that receives the command - in a full cluster setup, users may need to create the index on each node. This is a fundamental limitation of valkey-search in cluster mode and is documented in the Valkey Search documentation.
Streaming
store() expects a complete response string. Accumulate the full streamed response before calling store(). The createSemanticCacheMiddleware Vercel AI SDK adapter does not implement wrapStream.
Schema migration
Adding binary_refs, temperature, top_p, and seed fields to the index schema in v0.2.0 requires a schema migration for existing v0.1.0 indexes. If the existing index lacks these fields, check() operates in text-only mode (no binary filtering). To migrate, call flush() and initialize() to rebuild with the full schema.
Valkey Search 1.2 compatibility notes
FT.INFOerror format: handles three variants for cross-compatibilityFT.DROPINDEX DDnot supported: key cleanup done via SCAN + DELFT.SEARCHKNN score aliases: not usable in RETURN/SORTBYFT.INFOdimension: nested inside"index"sub-array as"dimensions"