Prometheus Metrics Reference

Complete reference for all metrics exposed by BetterDB Monitor at the /prometheus/metrics endpoint.

Overview
Metrics Categories
Scrape Configuration
Useful PromQL Queries
Alertmanager Rules

Overview

BetterDB Monitor exposes Prometheus-compatible metrics at:

GET /prometheus/metrics
Content-Type: text/plain; version=0.0.4; charset=utf-8

All custom metrics are prefixed with betterdb_. Standard Node.js process metrics from prom-client are also included with the same prefix.

Scrape Interval: Recommended 15s Metrics Update: Metrics are computed on-demand during each scrape

Metrics Categories

ACL Audit Metrics

Track ACL denied events captured from the monitored Valkey/Redis instance.

Metric	Type	Labels	Description	Example
`betterdb_acl_denied`	gauge	-	Total ACL denied events captured	`42`
`betterdb_acl_denied_by_reason`	gauge	`reason`	ACL denied events by reason (auth, command, key, channel)	`15`
`betterdb_acl_denied_by_user`	gauge	`username`	ACL denied events by username	`8`

Cardinality Warning: betterdb_acl_denied_by_user cardinality scales with number of unique usernames experiencing failures.

Client Analytics Metrics

Monitor client connection patterns and trends.

Metric	Type	Labels	Description	Example
`betterdb_client_connections_current`	gauge	-	Current number of client connections	`127`
`betterdb_client_connections_peak`	gauge	-	Peak connections in retention period	`256`
`betterdb_client_connections_by_name`	gauge	`client_name`	Current connections by client name	`12`
`betterdb_client_connections_by_user`	gauge	`user`	Current connections by ACL user	`25`

Cardinality Warning: Label-based metrics scale with unique client names and usernames.

Slowlog Metrics

Analyze slow query patterns aggregated from SLOWLOG data.

Metric	Type	Labels	Description	Example
`betterdb_slowlog_length`	gauge	-	Current slowlog length	`128`
`betterdb_slowlog_last_id`	gauge	-	ID of last slowlog entry	`12345`
`betterdb_slowlog_pattern_count`	gauge	`pattern`	Number of slow queries per pattern	`24`
`betterdb_slowlog_pattern_avg_duration_us`	gauge	`pattern`	Average duration in microseconds per pattern	`1250000`
`betterdb_slowlog_pattern_percentage`	gauge	`pattern`	Percentage of slow queries per pattern	`18.75`

Pattern Examples: GET *, HGETALL *, SCAN *

COMMANDLOG Metrics (Valkey 8.1+)

Valkey-specific metrics for tracking large request/reply commands.

Metric	Type	Labels	Description	Example
`betterdb_commandlog_large_request`	gauge	-	Total large request entries	`15`
`betterdb_commandlog_large_reply`	gauge	-	Total large reply entries	`8`
`betterdb_commandlog_large_request_by_pattern`	gauge	`pattern`	Large request count by command pattern	`5`
`betterdb_commandlog_large_reply_by_pattern`	gauge	`pattern`	Large reply count by command pattern	`3`

Availability: Only populated when connected to Valkey 8.1+. Returns no data for Redis or older Valkey versions.

Vector Index Metrics

Per-index health metrics for vector search indexes, populated by VectorSearchService every 30 s. Gauges are emitted once per (connection, index) pair, and stale labels are automatically removed when an index is dropped between polls.

Metric	Type	Labels	Description	Example
`betterdb_vector_index_docs`	gauge	`index`	Current document count for a vector index	`30000`
`betterdb_vector_index_memory_bytes`	gauge	`index`	Current memory usage for a vector index, in bytes	`62914560`
`betterdb_vector_index_indexing_failures`	gauge	`index`	Cumulative `hash_indexing_failures` for a vector index	`0`
`betterdb_vector_index_percent_indexed`	gauge	`index`	Percent of documents indexed (0–100)	`100`

Availability: Only populated when connected to an instance with the Search module loaded (RediSearch or valkey-search). Returns no data otherwise. See Vector / AI for the feature overview and the REST endpoints that back the monitor UI.

Cardinality Warning: Label cardinality scales with the number of indexes per connection. Typical deployments have single-digit index counts; if you run hundreds of indexes per instance, monitor scrape size accordingly.

Commandstats Metrics

Per-command execution counts and latency sourced from INFO commandstats, populated by CommandstatsPollerService every 60 s. Gauges are emitted once per (connection, command) pair, and stale labels are automatically removed when a command disappears between polls (e.g., after CONFIG RESETSTAT).

Metric	Type	Labels	Description	Example
`betterdb_commandstats_calls_total`	gauge	`command`	Cumulative number of times a command has been executed (`calls` from INFO commandstats)	`1523`
`betterdb_commandstats_latency_us`	gauge	`command`	Rolling average command latency in microseconds (`usec_per_call` from INFO commandstats)	`29700`

calls_total is published as a gauge (not a counter) because the source value is the absolute cumulative count reported by the server, not an increment computed in-process. This makes rate() queries behave correctly across both incremental polls and CONFIG RESETSTAT-driven counter resets, without needing Prometheus-side counter-reset detection.

Cardinality Warning: Label cardinality scales with the number of distinct commands executed per connection. A typical Valkey workload exposes a few dozen; modules like RediSearch add another handful. If your workload uses a very large module surface, monitor scrape size accordingly.

Inference Latency Metrics

Percentile latency for inference-shaped buckets, sourced from per-entry duration tables (command_log_entries on Valkey 8.1+, slowlog_entries elsewhere). Buckets are FT.SEARCH:<index-name> per configured index, plus aggregate read (GET/MGET) and write (SET/HSET family). Populated by the InferenceLatencyService evaluation loop; stale labels are removed when a bucket disappears.

Metric	Type	Labels	Description	Example
`betterdb_inference_bucket_p50_us`	gauge	`bucket`	p50 latency in microseconds for an inference bucket	`4200`
`betterdb_inference_bucket_p95_us`	gauge	`bucket`	p95 latency in microseconds for an inference bucket	`18500`
`betterdb_inference_bucket_p99_us`	gauge	`bucket`	p99 latency in microseconds for an inference bucket	`31000`
`betterdb_inference_unhealthy`	gauge	`bucket`	Whether a bucket is unhealthy (p50 > 10 ms for `FT.SEARCH:*`): 1 unhealthy, 0 healthy	`0`
`betterdb_inference_sla_breach`	gauge	`index`	Whether the configured per-index p99 SLA is currently breached: 1 breached, 0 ok	`0`

Threshold-gating bias: the source tables only store entries slower than the configured threshold directive (commandlog-execution-slower-than or slowlog-log-slower-than), so percentiles skew toward the tail. The /inference-latency/profile HTTP response exposes the active directive + value so consumers can qualify the number.

Cardinality Warning: FT.SEARCH buckets scale with the number of vector indexes per connection. inference_sla_breach only emits for indexes with an active SLA configured (Pro tier). Aggregate read / write buckets are constant cardinality.

Server Info Metrics

Basic server identification and uptime.

Metric	Type	Labels	Description	Example
`betterdb_uptime_in_seconds`	gauge	-	Server uptime in seconds	`864000`
`betterdb_instance_info`	gauge	`version`, `role`, `os`	Instance information (always 1)	`1`

Label Example: version="8.0.1", role="master", os="Linux 5.15.0"

Memory Metrics

Detailed memory usage and fragmentation tracking.

Metric	Type	Labels	Description	Example
`betterdb_memory_used_bytes`	gauge	-	Total allocated memory in bytes	`1073741824`
`betterdb_memory_used_rss_bytes`	gauge	-	RSS memory usage in bytes	`1200000000`
`betterdb_memory_used_peak_bytes`	gauge	-	Peak memory usage in bytes	`1500000000`
`betterdb_memory_max_bytes`	gauge	-	Maximum memory limit in bytes (0 if unlimited)	`2147483648`
`betterdb_memory_fragmentation_ratio`	gauge	-	Memory fragmentation ratio	`1.15`
`betterdb_memory_fragmentation_bytes`	gauge	-	Memory fragmentation in bytes	`126000000`

Stats Metrics

Operational statistics and throughput.

Metric	Type	Labels	Description	Example
`betterdb_connections_received_total`	gauge	-	Total connections received	`45678`
`betterdb_commands_processed_total`	gauge	-	Total commands processed	`12456789`
`betterdb_instantaneous_ops_per_sec`	gauge	-	Current operations per second	`2500`
`betterdb_instantaneous_input_kbps`	gauge	-	Current input kilobytes per second	`125.5`
`betterdb_instantaneous_output_kbps`	gauge	-	Current output kilobytes per second	`856.3`
`betterdb_keyspace_hits_total`	gauge	-	Total keyspace hits	`9876543`
`betterdb_keyspace_misses_total`	gauge	-	Total keyspace misses	`234567`
`betterdb_evicted_keys_total`	gauge	-	Total evicted keys	`1234`
`betterdb_expired_keys_total`	gauge	-	Total expired keys	`56789`
`betterdb_pubsub_channels`	gauge	-	Number of pub/sub channels	`12`
`betterdb_pubsub_patterns`	gauge	-	Number of pub/sub patterns	`3`

CPU Metrics

Server CPU consumption from the Valkey/Redis INFO CPU section.

Metric	Type	Labels	Description	Example
`betterdb_cpu_sys_seconds_total`	gauge	`connection`	Cumulative system CPU time consumed by the server in seconds	`123.45`
`betterdb_cpu_user_seconds_total`	gauge	`connection`	Cumulative user CPU time consumed by the server in seconds	`456.78`

System vs User CPU

cpu_sys_seconds_total tracks time the CPU spent in kernel space on behalf of the Valkey process - network I/O syscalls, memory allocation, and other OS-level operations. cpu_user_seconds_total tracks time spent executing Valkey’s own code in userspace - command processing, data structure operations, Lua scripts, and so on.

For a lightly loaded instance, system CPU is typically higher than user because most work is network I/O. A spike in user CPU points to CPU-intensive commands (large SORT operations, complex Lua scripts, big key scans). A spike in system CPU points to network or memory pressure.

Note: These are cumulative counters exposed as gauges. Use rate() in PromQL to compute per-second CPU usage.

Replication Metrics

Replication status and offset tracking.

Metric	Type	Labels	Description	Example
`betterdb_connected_slaves`	gauge	-	Number of connected replicas	`2`
`betterdb_replication_offset`	gauge	-	Replication offset	`123456789`
`betterdb_master_link_up`	gauge	-	1 if link to master is up (replica only)	`1`
`betterdb_master_last_io_seconds_ago`	gauge	-	Seconds since last I/O with master (replica only)	`2`

Keyspace Metrics

Per-database key statistics.

Metric	Type	Labels	Description	Example
`betterdb_db_keys`	gauge	`db`	Total keys in database	`125000`
`betterdb_db_keys_expiring`	gauge	`db`	Keys with expiration in database	`45000`
`betterdb_db_avg_ttl_seconds`	gauge	`db`	Average TTL in seconds	`3600`

Label Example: db="db0", db="db1"

Cluster Metrics

Cluster mode health and slot distribution.

Metric	Type	Labels	Description	Example
`betterdb_cluster_enabled`	gauge	-	1 if cluster mode is enabled	`1`
`betterdb_cluster_known_nodes`	gauge	-	Number of known cluster nodes	`6`
`betterdb_cluster_size`	gauge	-	Number of master nodes in cluster	`3`
`betterdb_cluster_slots_assigned`	gauge	-	Number of assigned slots	`16384`
`betterdb_cluster_slots_ok`	gauge	-	Number of slots in OK state	`16384`
`betterdb_cluster_slots_fail`	gauge	-	Number of slots in FAIL state	`0`
`betterdb_cluster_slots_pfail`	gauge	-	Number of slots in PFAIL state	`0`

Cluster Slot Metrics (Valkey 8.0+)

Metric	Type	Labels	Description	Example
`betterdb_cluster_slot_keys`	gauge	`slot`	Keys in cluster slot	`512`
`betterdb_cluster_slot_expires`	gauge	`slot`	Expiring keys in cluster slot	`128`
`betterdb_cluster_slot_reads_total`	gauge	`slot`	Total reads for cluster slot	`45678`
`betterdb_cluster_slot_writes_total`	gauge	`slot`	Total writes for cluster slot	`12345`

Availability: Only populated when connected to Valkey 8.0+ cluster. Limited to top 100 slots by key count.

Anomaly Detection Metrics

Real-time anomaly detection system metrics.

Event Metrics

Metric	Type	Labels	Description	Example
`betterdb_anomaly_events_total`	counter	`severity`, `metric_type`, `anomaly_type`	Total anomaly events detected	`42`
`betterdb_anomaly_events_current`	gauge	`severity`	Unresolved anomalies by severity	`3`
`betterdb_anomaly_by_severity`	gauge	`severity`	Anomalies in last hour by severity	`12`
`betterdb_anomaly_by_metric`	gauge	`metric_type`	Anomalies in last hour by metric	`8`

Label Values:

severity: info, warning, critical
metric_type: connections, ops_per_sec, memory_used, input_kbps, output_kbps, slowlog_last_id, acl_denied, evicted_keys, blocked_clients, keyspace_misses, fragmentation_ratio, cpu_utilization, replication_role
anomaly_type: spike, drop

Correlation Metrics

Metric	Type	Labels	Description	Example
`betterdb_correlated_groups_total`	counter	`pattern`, `severity`	Total correlated anomaly groups	`15`
`betterdb_correlated_groups_by_severity`	gauge	`severity`	Groups in last hour by severity	`8`
`betterdb_correlated_groups_by_pattern`	gauge	`pattern`	Groups in last hour by pattern	`5`

Pattern Values: traffic_burst, batch_job, memory_pressure, slow_queries, auth_attack, connection_leak, cache_thrashing, node_failover, unknown

Buffer Stats Metrics

Metric	Type	Labels	Description	Example
`betterdb_anomaly_buffer_ready`	gauge	`metric_type`	Buffer ready state (1=ready, 0=warming)	`1`
`betterdb_anomaly_buffer_mean`	gauge	`metric_type`	Rolling mean for anomaly detection	`2450`
`betterdb_anomaly_buffer_stddev`	gauge	`metric_type`	Rolling stddev for anomaly detection	`125.5`

Internal Metrics

BetterDB Monitor application health metrics.

Metric	Type	Labels	Description	Example
`betterdb_polls_total`	counter	-	Total number of poll cycles completed	`123456`
`betterdb_poll_duration_seconds`	histogram	`service`	Duration of poll cycles in seconds	buckets: 0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10

Service Values: Names of polling services (audit, client-analytics, metrics, etc.)

Node.js Process Metrics

Standard process metrics provided by prom-client with betterdb_ prefix.

CPU & Memory

Metric	Type	Description
`betterdb_process_cpu_user_seconds_total`	counter	Total user CPU time spent in seconds
`betterdb_process_cpu_system_seconds_total`	counter	Total system CPU time spent in seconds
`betterdb_process_cpu_seconds_total`	counter	Total user and system CPU time spent in seconds
`betterdb_process_resident_memory_bytes`	gauge	Resident memory size in bytes
`betterdb_process_virtual_memory_bytes`	gauge	Virtual memory size in bytes
`betterdb_process_heap_bytes`	gauge	Process heap size in bytes

File Descriptors

Metric	Type	Description
`betterdb_process_open_fds`	gauge	Number of open file descriptors
`betterdb_process_max_fds`	gauge	Maximum number of open file descriptors

Event Loop

Metric	Type	Description
`betterdb_nodejs_eventloop_lag_seconds`	gauge	Lag of event loop in seconds
`betterdb_nodejs_eventloop_lag_min_seconds`	gauge	Minimum recorded event loop delay
`betterdb_nodejs_eventloop_lag_max_seconds`	gauge	Maximum recorded event loop delay
`betterdb_nodejs_eventloop_lag_mean_seconds`	gauge	Mean of recorded event loop delays
`betterdb_nodejs_eventloop_lag_stddev_seconds`	gauge	Standard deviation of recorded event loop delays
`betterdb_nodejs_eventloop_lag_p50_seconds`	gauge	50th percentile of recorded event loop delays
`betterdb_nodejs_eventloop_lag_p90_seconds`	gauge	90th percentile of recorded event loop delays
`betterdb_nodejs_eventloop_lag_p99_seconds`	gauge	99th percentile of recorded event loop delays

Heap & GC

Metric	Type	Labels	Description
`betterdb_nodejs_heap_size_total_bytes`	gauge	-	Process heap size from Node.js in bytes
`betterdb_nodejs_heap_size_used_bytes`	gauge	-	Process heap size used from Node.js in bytes
`betterdb_nodejs_external_memory_bytes`	gauge	-	Node.js external memory size in bytes
`betterdb_nodejs_heap_space_size_total_bytes`	gauge	`space`	Process heap space size total in bytes
`betterdb_nodejs_heap_space_size_used_bytes`	gauge	`space`	Process heap space size used in bytes
`betterdb_nodejs_heap_space_size_available_bytes`	gauge	`space`	Process heap space size available in bytes
`betterdb_nodejs_gc_duration_seconds`	histogram	`kind`	Garbage collection duration (major, minor, incremental, weakcb)

Scrape Configuration

Basic Prometheus Configuration

scrape_configs:
  - job_name: 'betterdb'
    static_configs:
      - targets: ['localhost:3001']
    metrics_path: '/prometheus/metrics'
    scrape_interval: 15s
    scrape_timeout: 10s

Multi-Instance Setup

scrape_configs:
  - job_name: 'betterdb'
    static_configs:
      - targets:
        - 'betterdb-prod-1:3001'
        - 'betterdb-prod-2:3001'
        - 'betterdb-staging:3001'
        labels:
          env: 'production'
    metrics_path: '/prometheus/metrics'
    scrape_interval: 15s

With Service Discovery (Kubernetes)

scrape_configs:
  - job_name: 'betterdb'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: betterdb-monitor
      - source_labels: [__meta_kubernetes_pod_ip]
        action: replace
        target_label: __address__
        replacement: '${1}:3001'
    metrics_path: '/prometheus/metrics'
    scrape_interval: 15s

Useful PromQL Queries

Anomaly Detection

# Anomaly detection rate (events per minute)
rate(betterdb_anomaly_events_total[5m]) * 60

# Critical anomalies in last hour
betterdb_anomaly_by_severity{severity="critical"}

# Detection system readiness percentage
sum(betterdb_anomaly_buffer_ready) / count(betterdb_anomaly_buffer_ready) * 100

# Memory pressure incidents in last hour
increase(betterdb_correlated_groups_total{pattern="memory_pressure"}[1h])

# Top metrics causing anomalies
topk(5, betterdb_anomaly_by_metric)

# Unresolved critical anomalies
betterdb_anomaly_events_current{severity="critical"}

CPU Utilization

# Per-second CPU usage (system + user combined)
rate(betterdb_cpu_sys_seconds_total[5m]) + rate(betterdb_cpu_user_seconds_total[5m])

# System vs user CPU breakdown
rate(betterdb_cpu_sys_seconds_total[5m])
rate(betterdb_cpu_user_seconds_total[5m])

Memory & Performance

# Memory usage percentage (if maxmemory is set)
(betterdb_memory_used_bytes / betterdb_memory_max_bytes) * 100

# Memory fragmentation ratio (alert if > 1.5)
betterdb_memory_fragmentation_ratio

# Cache hit rate percentage
(betterdb_keyspace_hits_total / (betterdb_keyspace_hits_total + betterdb_keyspace_misses_total)) * 100

# Operations per second trend
rate(betterdb_commands_processed_total[5m])

# Network throughput (combined input + output)
betterdb_instantaneous_input_kbps + betterdb_instantaneous_output_kbps

Client Analytics

# Connection growth rate
rate(betterdb_connections_received_total[5m])

# Current connection count by user
sum by (user) (betterdb_client_connections_by_user)

# Peak vs current connections
betterdb_client_connections_peak - betterdb_client_connections_current

Slowlog Analysis

# Top 5 slow query patterns
topk(5, betterdb_slowlog_pattern_count)

# Slowest query patterns by average duration
topk(5, betterdb_slowlog_pattern_avg_duration_us)

# Slowlog growth rate
rate(betterdb_slowlog_length[5m])

Cluster Health

# Cluster slot health percentage
(betterdb_cluster_slots_ok / betterdb_cluster_slots_assigned) * 100

# Failed slots alert
betterdb_cluster_slots_fail + betterdb_cluster_slots_pfail

# Replication lag (for replicas)
betterdb_master_last_io_seconds_ago

Application Health

# BetterDB Monitor event loop lag (alert if > 100ms)
betterdb_nodejs_eventloop_lag_p99_seconds > 0.1

# Poll duration 99th percentile
histogram_quantile(0.99, rate(betterdb_poll_duration_seconds_bucket[5m]))

# High cardinality metric check (client names)
count(betterdb_client_connections_by_name)

Alertmanager Rules

The following alert rules are production-ready. See docs/alertmanager-rules.yml for the complete YAML configuration.

Critical Alerts

BetterDBCriticalAnomaly - Fires immediately when a critical anomaly is detected

increase(betterdb_anomaly_events_total{severity="critical"}[5m]) > 0

BetterDBMemoryPressure - Memory pressure pattern detected

increase(betterdb_correlated_groups_total{pattern="memory_pressure"}[10m]) > 0

BetterDBAuthAnomaly - Potential authentication attack

increase(betterdb_correlated_groups_total{pattern="auth_attack"}[5m]) > 0

Warning Alerts

BetterDBWarningSpike - Multiple warning anomalies in short period

increase(betterdb_anomaly_events_total{severity="warning"}[5m]) > 5

BetterDBConnectionLeak - Possible connection leak pattern

increase(betterdb_correlated_groups_total{pattern="connection_leak"}[10m]) > 0
for: 5m

BetterDBTrafficBurst - Traffic burst detected

increase(betterdb_correlated_groups_total{pattern="traffic_burst"}[5m]) > 0

BetterDBUnresolvedCriticalAnomalies - Multiple unresolved critical anomalies

betterdb_anomaly_events_current{severity="critical"} > 3
for: 10m

BetterDBPersistentAnomalies - Persistent anomalies over time

betterdb_anomaly_by_severity{severity!="info"} > 10
for: 30m

Info Alerts

BetterDBAnomalyDetectionWarming - Anomaly detection system warming up

(sum(betterdb_anomaly_buffer_ready) / count(betterdb_anomaly_buffer_ready)) < 1
for: 5m

Grafana Integration

Import Ready-Made Dashboard

Navigate to Grafana → Dashboards → Import
Use BetterDB Monitor dashboard ID: [Coming Soon]
Select your Prometheus datasource
Click Import

Creating Custom Dashboards

Recommended Panels:

Anomaly Overview - Gauge showing unresolved critical anomalies
Anomaly Timeline - Graph of rate(betterdb_anomaly_events_total[5m]) by severity
Pattern Detection - Bar chart of betterdb_correlated_groups_by_pattern
Memory Usage - Graph showing betterdb_memory_used_bytes vs betterdb_memory_max_bytes
Cache Hit Rate - Graph showing cache hit rate percentage
Connection Trends - Graph of betterdb_client_connections_current and peak
Slow Query Patterns - Table showing top patterns from betterdb_slowlog_pattern_*
Buffer Readiness - Heatmap of betterdb_anomaly_buffer_ready by metric type

Example Panel Query (Memory Usage)

{
  "expr": "betterdb_memory_used_bytes",
  "legendFormat": "Used Memory",
  "refId": "A"
},
{
  "expr": "betterdb_memory_max_bytes",
  "legendFormat": "Max Memory",
  "refId": "B"
}

Configuration

Metrics Update Interval

The anomaly detection Prometheus summary is updated every 30 seconds by default. Configure via:

ANOMALY_PROMETHEUS_INTERVAL_MS=30000

Or update at runtime via the /settings API endpoint:

curl -X PUT http://localhost:3001/settings \
  -H "Content-Type: application/json" \
  -d '{"anomalyPrometheusIntervalMs": 15000}'

Cardinality Management

High-cardinality labels can impact Prometheus performance. Monitor these metrics:

betterdb_client_connections_by_name - Scales with unique client names
betterdb_client_connections_by_user - Scales with unique usernames
betterdb_cluster_slot_* - Limited to top 100 slots automatically

If cardinality becomes an issue, consider:

Aggregating client names using relabel_configs in Prometheus
Filtering specific labels using metric_relabel_configs
Reducing retention period for client analytics data

Troubleshooting

Missing Metrics

COMMANDLOG metrics not appearing?

Check Valkey version: Requires Valkey 8.1+
Verify connection: Ensure BetterDB is connected to Valkey (not Redis)

Cluster slot metrics not appearing?

Check Valkey version: Requires Valkey 8.0+
Verify cluster mode: Ensure the instance is in cluster mode

Anomaly metrics showing zeros?

Wait for warmup: Anomaly detection requires 30 samples (30 seconds at 1s poll rate)
Check buffer readiness: Query betterdb_anomaly_buffer_ready

High Scrape Duration

If /prometheus/metrics takes >1s to respond:

Reduce slowlog analysis sample size (default: 128 entries)
Reduce cluster slot stats limit (default: 100 slots)
Increase scrape timeout in Prometheus config
Check if database is responding slowly

Stale Metrics

If metrics appear outdated:

Verify BetterDB Monitor is running: Check betterdb_process_start_time_seconds
Check database connectivity: Review /health endpoint
Verify polling services: Check betterdb_polls_total is incrementing