Semantic Cache

@betterdb/semantic-cache is a standalone, framework-agnostic semantic cache library for LLM applications backed by Valkey. It uses the valkey-search module’s vector similarity search to match incoming prompts against previously cached responses, returning hits when the cosine distance falls below a configurable threshold.

v0.2.0 adds full adapter parity with agent-cache: OpenAI, Anthropic, LlamaIndex, LangGraph, multi-modal prompt support, cost tracking, threshold effectiveness recommendations, embedding caching, batch lookups, and more.

Prerequisites

Valkey 8.0+ with the valkey-search module loaded
Or Amazon ElastiCache for Valkey (8.0+)
Or Google Cloud Memorystore for Valkey
Node.js >= 20

Installation

npm install @betterdb/semantic-cache iovalkey

iovalkey is a peer dependency - you must install it alongside the package.

Quick start

import Valkey from 'iovalkey';
import { SemanticCache } from '@betterdb/semantic-cache';
import { createOpenAIEmbed } from '@betterdb/semantic-cache/embed/openai';

const client = new Valkey({ host: 'localhost', port: 6399 });

const cache = new SemanticCache({
  client,
  embedFn: createOpenAIEmbed(), // text-embedding-3-small by default
  defaultThreshold: 0.1,
  defaultTtl: 3600,
});

await cache.initialize();

await cache.store('What is the capital of France?', 'Paris', {
  model: 'gpt-4o',
  inputTokens: 20,
  outputTokens: 5,
});

const result = await cache.check('Capital city of France?');
// result.hit === true
// result.response === 'Paris'
// result.costSaved === 0.000105 (based on bundled LiteLLM prices)

Configuration reference

Option	Type	Default	Description
`name`	`string`	`'betterdb_scache'`	Index name prefix for all Valkey keys
`client`	`Valkey`	required	An `iovalkey` client instance
`embedFn`	`(text: string) => Promise<number[]>`	required	Embedding function
`defaultThreshold`	`number`	`0.1`	Cosine distance threshold (0-2)
`defaultTtl`	`number`	`undefined`	Default TTL in seconds
`categoryThresholds`	`Record<string, number>`	`{}`	Per-category threshold overrides
`uncertaintyBand`	`number`	`0.05`	Width of the uncertainty band below threshold
`costTable`	`Record<string, ModelCost>`	`undefined`	Custom model pricing overrides
`useDefaultCostTable`	`boolean`	`true`	Merge bundled LiteLLM price table
`normalizer`	`BinaryNormalizer`	`defaultNormalizer`	Binary content normalizer for multi-modal prompts
`embeddingCache.enabled`	`boolean`	`true`	Cache computed embeddings in Valkey
`embeddingCache.ttl`	`number`	`86400`	Embedding cache TTL in seconds
`telemetry.tracerName`	`string`	`'@betterdb/semantic-cache'`	OTel tracer name
`telemetry.metricsPrefix`	`string`	`'semantic_cache'`	Prometheus metric name prefix
`telemetry.registry`	`Registry`	prom-client default	Custom prom-client Registry

Adapters

All adapters are subpath exports with optional peer dependencies.

LangChain

import { BetterDBSemanticCache } from '@betterdb/semantic-cache/langchain';
const llm = new ChatOpenAI({ cache: new BetterDBSemanticCache({ cache }) });

Vercel AI SDK

import { createSemanticCacheMiddleware } from '@betterdb/semantic-cache/ai';
const model = wrapLanguageModel({ model: openai('gpt-4o'), middleware: createSemanticCacheMiddleware({ cache }) });

OpenAI Chat Completions

import { prepareSemanticParams } from '@betterdb/semantic-cache/openai';
const { text, model } = await prepareSemanticParams(params);
const result = await cache.check(text);

OpenAI Responses API

import { prepareSemanticParams } from '@betterdb/semantic-cache/openai-responses';
const { text } = await prepareSemanticParams(params);

Anthropic Messages

import { prepareSemanticParams } from '@betterdb/semantic-cache/anthropic';
const { text } = await prepareSemanticParams(params);

LlamaIndex

import { prepareSemanticParams } from '@betterdb/semantic-cache/llamaindex';
const { text } = await prepareSemanticParams(messages, { model: 'gpt-4o' });

LangGraph (semantic memory store)

import { BetterDBSemanticStore } from '@betterdb/semantic-cache/langgraph';
const store = new BetterDBSemanticStore({ cache });
await store.put(['user', 'alice', 'memories'], 'mem1', { content: 'Alice lives in Paris.' });
const results = await store.search(['user', 'alice', 'memories'], { query: 'Where does Alice live?' });

Use BetterDBSemanticStore for similarity-based memory retrieval. For exact-match checkpoint persistence, use @betterdb/agent-cache/langgraph.

Embedding helpers

Pre-built EmbedFn factories for common providers:

import { createOpenAIEmbed } from '@betterdb/semantic-cache/embed/openai';
import { createBedrockEmbed } from '@betterdb/semantic-cache/embed/bedrock';
import { createVoyageEmbed } from '@betterdb/semantic-cache/embed/voyage';
import { createCohereEmbed } from '@betterdb/semantic-cache/embed/cohere';
import { createOllamaEmbed } from '@betterdb/semantic-cache/embed/ollama';

Helper	Model default	Dimensions
`createOpenAIEmbed`	`text-embedding-3-small`	1536
`createBedrockEmbed`	`amazon.titan-embed-text-v2:0`	1024
`createVoyageEmbed`	`voyage-3-lite`	512
`createCohereEmbed`	`embed-english-v3.0`	1024
`createOllamaEmbed`	`nomic-embed-text`	768

Cost tracking

Store token counts alongside responses to enable cost savings reporting:

await cache.store('What is the capital of France?', 'Paris', {
  model: 'gpt-4o',
  inputTokens: 25,
  outputTokens: 5,
});

const result = await cache.check('Capital of France?');
// result.costSaved === 0.000105 on hit

const stats = await cache.stats();
// stats.costSavedMicros === 105 (microdollars)

Cost is computed using the bundled LiteLLM price table (1,971 models). Override or extend with costTable option.

Use ContentBlock[] to cache prompts with binary content:

import { hashBase64, type ContentBlock } from '@betterdb/semantic-cache';

const prompt: ContentBlock[] = [
  { type: 'text', text: 'Describe this image.' },
  { type: 'binary', kind: 'image', mediaType: 'image/png', ref: hashBase64(imageBase64) },
];

await cache.store(prompt, 'A red square.');
const result = await cache.check(prompt); // hit only if text AND image match

Use storeMultipart() to store structured response blocks:

const blocks: ContentBlock[] = [
  { type: 'text', text: 'The answer is 42.' },
  { type: 'reasoning', text: 'By my calculation...' },
];
await cache.storeMultipart(prompt, blocks);

const result = await cache.check(prompt);
// result.contentBlocks === blocks

Threshold effectiveness recommendations

Analyze the rolling similarity score window for threshold tuning guidance:

const analysis = await cache.thresholdEffectiveness({ minSamples: 100 });
// analysis.recommendation: 'tighten_threshold' | 'loosen_threshold' | 'optimal' | 'insufficient_data'
// analysis.recommendedThreshold: 0.085 (present when recommendation is not optimal/insufficient)
// analysis.reasoning: 'Human-readable explanation'

// Per-category analysis
const allCategories = await cache.thresholdEffectivenessAll();

Batch check

Pipeline multiple lookups in a single round-trip:

const results = await cache.checkBatch([
  'What is the capital of France?',
  'Who wrote Hamlet?',
  'What is 2 + 2?',
]);
// results[0].hit === true, etc.

Stale model eviction

Automatically evict cache entries when the model changes:

const result = await cache.check('What is 2+2?', {
  staleAfterModelChange: true,
  currentModel: 'gpt-4o',
});
// If the cached entry was stored with model='gpt-3.5-turbo', it's evicted and treated as miss

Rerank hook

Retrieve top-k candidates and select the best with a custom function:

const result = await cache.check(prompt, {
  rerank: {
    k: 5,
    rerankFn: async (query, candidates) => {
      // Return index of best candidate, or -1 to reject all
      return candidates.findIndex((c) => c.response.length > 50);
    },
  },
});

Params-aware filtering

Store sampling parameters as indexed NUMERIC fields for opt-in filtering:

await cache.store(prompt, response, { temperature: 0.7, topP: 0.9, seed: 42 });
const result = await cache.check(prompt, { filter: '@temperature:[0 0]' });

Invalidation helpers

await cache.invalidateByModel('gpt-4o');       // delete all entries for a model
await cache.invalidateByCategory('geography'); // delete all entries for a category

Observability

Prometheus metrics

Metric	Type	Labels	Description
`{prefix}_requests_total`	Counter	`cache_name`, `result`, `category`	Total lookups (result: hit/miss/uncertain_hit)
`{prefix}_similarity_score`	Histogram	`cache_name`, `category`	Cosine distance on every lookup with a candidate
`{prefix}_operation_duration_seconds`	Histogram	`cache_name`, `operation`	End-to-end operation duration
`{prefix}_embedding_duration_seconds`	Histogram	`cache_name`	Time in embedFn
`{prefix}_cost_saved_total`	Counter	`cache_name`, `category`	Cumulative dollars saved from cache hits
`{prefix}_embedding_cache_total`	Counter	`cache_name`, `result`	Embedding cache hit/miss counts
`{prefix}_stale_model_evictions_total`	Counter	`cache_name`	Entries evicted by staleAfterModelChange

Known limitations

Cluster mode

flush() and embedding cache cleanup use SCAN. In Valkey Cluster mode, SCAN on a single node only iterates that node’s keys. v0.2.0 uses clusterScan() (same pattern as agent-cache) to fan out across all master nodes for these operations.

The FT.CREATE index and FT.SEARCH queries work correctly in cluster mode because Valkey routes them to the appropriate node. However, FT.CREATE creates the index only on the node that receives the command - in a full cluster setup, users may need to create the index on each node. This is a fundamental limitation of valkey-search in cluster mode and is documented in the Valkey Search documentation.

Streaming

store() expects a complete response string. Accumulate the full streamed response before calling store(). The createSemanticCacheMiddleware Vercel AI SDK adapter does not implement wrapStream.

Schema migration

Adding binary_refs, temperature, top_p, and seed fields to the index schema in v0.2.0 requires a schema migration for existing v0.1.0 indexes. If the existing index lacks these fields, check() operates in text-only mode (no binary filtering). To migrate, call flush() and initialize() to rebuild with the full schema.

Valkey Search 1.2 compatibility notes

FT.INFO error format: handles three variants for cross-compatibility
FT.DROPINDEX DD not supported: key cleanup done via SCAN + DEL
FT.SEARCH KNN score aliases: not usable in RETURN/SORTBY
FT.INFO dimension: nested inside "index" sub-array as "dimensions"