Semantic Cache

@betterdb/semantic-cache is a standalone, framework-agnostic semantic cache library for LLM applications backed by Valkey. It uses the valkey-search module’s vector similarity search to match incoming prompts against previously cached responses, returning hits when the cosine distance falls below a configurable threshold.

v0.2.0 adds full adapter parity with agent-cache: OpenAI, Anthropic, LlamaIndex, LangGraph, multi-modal prompt support, cost tracking, threshold effectiveness recommendations, embedding caching, batch lookups, and more.

Prerequisites

  • Valkey 8.0+ with the valkey-search module loaded
  • Or Amazon ElastiCache for Valkey (8.0+)
  • Or Google Cloud Memorystore for Valkey
  • Node.js >= 20

Installation

npm install @betterdb/semantic-cache iovalkey

iovalkey is a peer dependency - you must install it alongside the package.

Quick start

import Valkey from 'iovalkey';
import { SemanticCache } from '@betterdb/semantic-cache';
import { createOpenAIEmbed } from '@betterdb/semantic-cache/embed/openai';

const client = new Valkey({ host: 'localhost', port: 6399 });

const cache = new SemanticCache({
  client,
  embedFn: createOpenAIEmbed(), // text-embedding-3-small by default
  defaultThreshold: 0.1,
  defaultTtl: 3600,
});

await cache.initialize();

await cache.store('What is the capital of France?', 'Paris', {
  model: 'gpt-4o',
  inputTokens: 20,
  outputTokens: 5,
});

const result = await cache.check('Capital city of France?');
// result.hit === true
// result.response === 'Paris'
// result.costSaved === 0.000105 (based on bundled LiteLLM prices)

Configuration reference

Option Type Default Description
name string 'betterdb_scache' Index name prefix for all Valkey keys
client Valkey required An iovalkey client instance
embedFn (text: string) => Promise<number[]> required Embedding function
defaultThreshold number 0.1 Cosine distance threshold (0-2)
defaultTtl number undefined Default TTL in seconds
categoryThresholds Record<string, number> {} Per-category threshold overrides
uncertaintyBand number 0.05 Width of the uncertainty band below threshold
costTable Record<string, ModelCost> undefined Custom model pricing overrides
useDefaultCostTable boolean true Merge bundled LiteLLM price table
normalizer BinaryNormalizer defaultNormalizer Binary content normalizer for multi-modal prompts
embeddingCache.enabled boolean true Cache computed embeddings in Valkey
embeddingCache.ttl number 86400 Embedding cache TTL in seconds
telemetry.tracerName string '@betterdb/semantic-cache' OTel tracer name
telemetry.metricsPrefix string 'semantic_cache' Prometheus metric name prefix
telemetry.registry Registry prom-client default Custom prom-client Registry

Threshold and confidence

@betterdb/semantic-cache uses cosine distance (0-2 scale, lower = more similar):

Distance Meaning
0.00 Identical vectors
0.05-0.10 Strong paraphrase
0.10-0.20 Loose paraphrase / related topic
1.00 Orthogonal (unrelated)

A lookup is a hit when score <= threshold. The default threshold is 0.1, set per cache via defaultThreshold, per category via categoryThresholds, or per request via the threshold check option.

Confidence levels

Every hit is graded against the uncertaintyBand (default 0.05), the width of the band immediately below the threshold:

confidence When What to do
high score <= threshold - uncertaintyBand (e.g. <= 0.05) Return the cached response directly
uncertain threshold - uncertaintyBand < score <= threshold (e.g. 0.05–0.10) Return the response but consider flagging for review, or hand it to the LLM-as-judge
miss score > threshold No hit - call the LLM

Recommended thresholds by use case:

Use case Threshold Notes
FAQ / exact match only 0.05 Very strict, near-zero false positives
Standard Q&A 0.10 Default - paraphrases land as uncertain
Conversational / RAG 0.15 Paraphrases hit as high confidence
Broad search / recall 0.20 High hit rate, review uncertain hits

Adapters

All adapters are subpath exports with optional peer dependencies.

LangChain

import { BetterDBSemanticCache } from '@betterdb/semantic-cache/langchain';
const llm = new ChatOpenAI({ cache: new BetterDBSemanticCache({ cache }) });

Vercel AI SDK

import { createSemanticCacheMiddleware } from '@betterdb/semantic-cache/ai';
const model = wrapLanguageModel({ model: openai('gpt-4o'), middleware: createSemanticCacheMiddleware({ cache }) });

OpenAI Chat Completions

import { prepareSemanticParams } from '@betterdb/semantic-cache/openai';
const { text, model } = await prepareSemanticParams(params);
const result = await cache.check(text);

OpenAI Responses API

import { prepareSemanticParams } from '@betterdb/semantic-cache/openai-responses';
const { text } = await prepareSemanticParams(params);

Anthropic Messages

import { prepareSemanticParams } from '@betterdb/semantic-cache/anthropic';
const { text } = await prepareSemanticParams(params);

LlamaIndex

import { prepareSemanticParams } from '@betterdb/semantic-cache/llamaindex';
const { text } = await prepareSemanticParams(messages, { model: 'gpt-4o' });

LangGraph (semantic memory store)

import { BetterDBSemanticStore } from '@betterdb/semantic-cache/langgraph';
const store = new BetterDBSemanticStore({ cache });
await store.put(['user', 'alice', 'memories'], 'mem1', { content: 'Alice lives in Paris.' });
const results = await store.search(['user', 'alice', 'memories'], { query: 'Where does Alice live?' });

Use BetterDBSemanticStore for similarity-based memory retrieval. For exact-match checkpoint persistence, use @betterdb/agent-cache/langgraph.

Embedding helpers

Pre-built EmbedFn factories for common providers:

import { createOpenAIEmbed } from '@betterdb/semantic-cache/embed/openai';
import { createBedrockEmbed } from '@betterdb/semantic-cache/embed/bedrock';
import { createVoyageEmbed } from '@betterdb/semantic-cache/embed/voyage';
import { createCohereEmbed } from '@betterdb/semantic-cache/embed/cohere';
import { createOllamaEmbed } from '@betterdb/semantic-cache/embed/ollama';
Helper Model default Dimensions
createOpenAIEmbed text-embedding-3-small 1536
createBedrockEmbed amazon.titan-embed-text-v2:0 1024
createVoyageEmbed voyage-3-lite 512
createCohereEmbed embed-english-v3.0 1024
createOllamaEmbed nomic-embed-text 768

Cost tracking

Store token counts alongside responses to enable cost savings reporting:

await cache.store('What is the capital of France?', 'Paris', {
  model: 'gpt-4o',
  inputTokens: 25,
  outputTokens: 5,
});

const result = await cache.check('Capital of France?');
// result.costSaved === 0.000105 on hit

const stats = await cache.stats();
// stats.costSavedMicros === 105 (microdollars)

Cost is computed using the bundled LiteLLM price table (1,971 models). Override or extend with costTable option.

Multi-modal prompts

Use ContentBlock[] to cache prompts with binary content:

import { hashBase64, type ContentBlock } from '@betterdb/semantic-cache';

const prompt: ContentBlock[] = [
  { type: 'text', text: 'Describe this image.' },
  { type: 'binary', kind: 'image', mediaType: 'image/png', ref: hashBase64(imageBase64) },
];

await cache.store(prompt, 'A red square.');
const result = await cache.check(prompt); // hit only if text AND image match

Use storeMultipart() to store structured response blocks:

const blocks: ContentBlock[] = [
  { type: 'text', text: 'The answer is 42.' },
  { type: 'reasoning', text: 'By my calculation...' },
];
await cache.storeMultipart(prompt, blocks);

const result = await cache.check(prompt);
// result.contentBlocks === blocks

Threshold effectiveness recommendations

Analyze the rolling similarity score window for threshold tuning guidance:

const analysis = await cache.thresholdEffectiveness({ minSamples: 100 });
// analysis.recommendation: 'tighten_threshold' | 'loosen_threshold' | 'optimal' | 'insufficient_data'
// analysis.recommendedThreshold: 0.085 (present when recommendation is not optimal/insufficient)
// analysis.reasoning: 'Human-readable explanation'

// Per-category analysis
const allCategories = await cache.thresholdEffectivenessAll();

Batch check

Pipeline multiple lookups in a single round-trip:

const results = await cache.checkBatch([
  'What is the capital of France?',
  'Who wrote Hamlet?',
  'What is 2 + 2?',
]);
// results[0].hit === true, etc.

Stale model eviction

Automatically evict cache entries when the model changes:

const result = await cache.check('What is 2+2?', {
  staleAfterModelChange: true,
  currentModel: 'gpt-4o',
});
// If the cached entry was stored with model='gpt-3.5-turbo', it's evicted and treated as miss

LLM-as-judge

When a hit lands in the uncertainty band (threshold - uncertaintyBand < score <= threshold), supply a judgeFn to adjudicate the borderline hit automatically instead of handling confidence: 'uncertain' yourself. This adds a second step on top of the similarity check: check() first classifies the hit, and only uncertain hits are passed to the judge, which accepts (promotes to high) or rejects (treats as a miss).

const result = await cache.check(userPrompt, {
  judge: {
    judgeFn: async ({ prompt, response, similarity, threshold, category }) => {
      // Return true to accept (confidence → 'high')
      // Return false to reject (treated as a miss with nearestMiss)
      const verdict = await openai.chat.completions.create({
        model: 'gpt-5-mini',
        messages: [
          { role: 'system', content: 'Reply YES or NO only.' },
          { role: 'user', content: `Does this cached response correctly answer the prompt?\nPrompt: ${prompt}\nResponse: ${response}` },
        ],
      });
      return verdict.choices[0].message.content?.startsWith('YES') ?? false;
    },
    onError: 'accept',  // fail-open on judge errors (default)
    timeoutMs: 2000,    // per-call timeout (default)
  },
});

When the judge is invoked: only for confidence === 'uncertain' hits. High-confidence hits, misses, and the zero-candidates case bypass the judge entirely.

Accept path: result.hit === true, result.confidence === 'high'.

Reject path: result.hit === false, result.nearestMiss populated with deltaToThreshold <= 0 (use this to distinguish judge rejections from regular misses, where deltaToThreshold > 0).

Composing with rerank: when both rerank and judge are set, the judge receives the reranked pick’s response and similarity score.

checkBatch() does not support judge. Call check() individually for prompts that need adjudication.

Rerank hook

Retrieve top-k candidates and select the best with a custom function:

const result = await cache.check(prompt, {
  rerank: {
    k: 5,
    rerankFn: async (query, candidates) => {
      // Return index of best candidate, or -1 to reject all
      return candidates.findIndex((c) => c.response.length > 50);
    },
  },
});

Params-aware filtering

Store sampling parameters as indexed NUMERIC fields for opt-in filtering:

await cache.store(prompt, response, { temperature: 0.7, topP: 0.9, seed: 42 });
const result = await cache.check(prompt, { filter: '@temperature:[0 0]' });

Invalidation helpers

await cache.invalidateByModel('gpt-4o');       // delete all entries for a model
await cache.invalidateByCategory('geography'); // delete all entries for a category

Observability

Prometheus metrics

Metric Type Labels Description
{prefix}_requests_total Counter cache_name, result, category Total lookups (result: hit/miss/uncertain_hit)
{prefix}_similarity_score Histogram cache_name, category Cosine distance on every lookup with a candidate
{prefix}_operation_duration_seconds Histogram cache_name, operation End-to-end operation duration
{prefix}_embedding_duration_seconds Histogram cache_name Time in embedFn
{prefix}_cost_saved_total Counter cache_name, category Cumulative dollars saved from cache hits
{prefix}_embedding_cache_total Counter cache_name, result Embedding cache hit/miss counts
{prefix}_stale_model_evictions_total Counter cache_name Entries evicted by staleAfterModelChange

Known limitations

Cluster mode

flush() and embedding cache cleanup use SCAN. In Valkey Cluster mode, SCAN on a single node only iterates that node’s keys. v0.2.0 uses clusterScan() (same pattern as agent-cache) to fan out across all master nodes for these operations.

The FT.CREATE index and FT.SEARCH queries work correctly in cluster mode because Valkey routes them to the appropriate node. However, FT.CREATE creates the index only on the node that receives the command - in a full cluster setup, users may need to create the index on each node. This is a fundamental limitation of valkey-search in cluster mode and is documented in the Valkey Search documentation.

Streaming

store() expects a complete response string. Accumulate the full streamed response before calling store(). The createSemanticCacheMiddleware Vercel AI SDK adapter does not implement wrapStream.

Schema migration

Adding binary_refs, temperature, top_p, and seed fields to the index schema in v0.2.0 requires a schema migration for existing v0.1.0 indexes. If the existing index lacks these fields, check() operates in text-only mode (no binary filtering). To migrate, call flush() and initialize() to rebuild with the full schema.

Valkey Search 1.2 compatibility notes

  1. FT.INFO error format: handles three variants for cross-compatibility
  2. FT.DROPINDEX DD not supported: key cleanup done via SCAN + DEL
  3. FT.SEARCH KNN score aliases: not usable in RETURN/SORTBY
  4. FT.INFO dimension: nested inside "index" sub-array as "dimensions"