Prometheus Metrics Reference
Complete reference for all metrics exposed by BetterDB Monitor at the /prometheus/metrics endpoint.
Table of Contents
- Overview
- Metrics Categories
- ACL Audit Metrics
- Client Analytics Metrics
- Slowlog Metrics
- COMMANDLOG Metrics
- Vector Index Metrics
- Commandstats Metrics
- Inference Latency Metrics
- Server Info Metrics
- Memory Metrics
- Stats Metrics
- CPU Metrics
- Replication Metrics
- Keyspace Metrics
- Cluster Metrics
- Anomaly Detection Metrics
- Internal Metrics
- Node.js Process Metrics
- Scrape Configuration
- Useful PromQL Queries
- Alertmanager Rules
Overview
BetterDB Monitor exposes Prometheus-compatible metrics at:
GET /prometheus/metrics
Content-Type: text/plain; version=0.0.4; charset=utf-8
All custom metrics are prefixed with betterdb_. Standard Node.js process metrics from prom-client are also included with the same prefix.
Scrape Interval: Recommended 15s Metrics Update: Metrics are computed on-demand during each scrape
Metrics Categories
ACL Audit Metrics
Track ACL denied events captured from the monitored Valkey/Redis instance.
| Metric | Type | Labels | Description | Example |
|---|---|---|---|---|
betterdb_acl_denied | gauge | - | Total ACL denied events captured | 42 |
betterdb_acl_denied_by_reason | gauge | reason | ACL denied events by reason (auth, command, key, channel) | 15 |
betterdb_acl_denied_by_user | gauge | username | ACL denied events by username | 8 |
Cardinality Warning: betterdb_acl_denied_by_user cardinality scales with number of unique usernames experiencing failures.
Client Analytics Metrics
Monitor client connection patterns and trends.
| Metric | Type | Labels | Description | Example |
|---|---|---|---|---|
betterdb_client_connections_current | gauge | - | Current number of client connections | 127 |
betterdb_client_connections_peak | gauge | - | Peak connections in retention period | 256 |
betterdb_client_connections_by_name | gauge | client_name | Current connections by client name | 12 |
betterdb_client_connections_by_user | gauge | user | Current connections by ACL user | 25 |
Cardinality Warning: Label-based metrics scale with unique client names and usernames.
Slowlog Metrics
Analyze slow query patterns aggregated from SLOWLOG data.
| Metric | Type | Labels | Description | Example |
|---|---|---|---|---|
betterdb_slowlog_length | gauge | - | Current slowlog length | 128 |
betterdb_slowlog_last_id | gauge | - | ID of last slowlog entry | 12345 |
betterdb_slowlog_pattern_count | gauge | pattern | Number of slow queries per pattern | 24 |
betterdb_slowlog_pattern_avg_duration_us | gauge | pattern | Average duration in microseconds per pattern | 1250000 |
betterdb_slowlog_pattern_percentage | gauge | pattern | Percentage of slow queries per pattern | 18.75 |
Pattern Examples: GET *, HGETALL *, SCAN *
COMMANDLOG Metrics (Valkey 8.1+)
Valkey-specific metrics for tracking large request/reply commands.
| Metric | Type | Labels | Description | Example |
|---|---|---|---|---|
betterdb_commandlog_large_request | gauge | - | Total large request entries | 15 |
betterdb_commandlog_large_reply | gauge | - | Total large reply entries | 8 |
betterdb_commandlog_large_request_by_pattern | gauge | pattern | Large request count by command pattern | 5 |
betterdb_commandlog_large_reply_by_pattern | gauge | pattern | Large reply count by command pattern | 3 |
Availability: Only populated when connected to Valkey 8.1+. Returns no data for Redis or older Valkey versions.
Vector Index Metrics
Per-index health metrics for vector search indexes, populated by VectorSearchService every 30 s. Gauges are emitted once per (connection, index) pair, and stale labels are automatically removed when an index is dropped between polls.
| Metric | Type | Labels | Description | Example |
|---|---|---|---|---|
betterdb_vector_index_docs | gauge | index | Current document count for a vector index | 30000 |
betterdb_vector_index_memory_bytes | gauge | index | Current memory usage for a vector index, in bytes | 62914560 |
betterdb_vector_index_indexing_failures | gauge | index | Cumulative hash_indexing_failures for a vector index | 0 |
betterdb_vector_index_percent_indexed | gauge | index | Percent of documents indexed (0–100) | 100 |
Availability: Only populated when connected to an instance with the Search module loaded (RediSearch or valkey-search). Returns no data otherwise. See Vector / AI for the feature overview and the REST endpoints that back the monitor UI.
Cardinality Warning: Label cardinality scales with the number of indexes per connection. Typical deployments have single-digit index counts; if you run hundreds of indexes per instance, monitor scrape size accordingly.
Commandstats Metrics
Per-command execution counts and latency sourced from INFO commandstats, populated by CommandstatsPollerService every 60 s. Gauges are emitted once per (connection, command) pair, and stale labels are automatically removed when a command disappears between polls (e.g., after CONFIG RESETSTAT).
| Metric | Type | Labels | Description | Example |
|---|---|---|---|---|
betterdb_commandstats_calls_total | gauge | command | Cumulative number of times a command has been executed (calls from INFO commandstats) | 1523 |
betterdb_commandstats_latency_us | gauge | command | Rolling average command latency in microseconds (usec_per_call from INFO commandstats) | 29700 |
calls_total is published as a gauge (not a counter) because the source value is the absolute cumulative count reported by the server, not an increment computed in-process. This makes rate() queries behave correctly across both incremental polls and CONFIG RESETSTAT-driven counter resets, without needing Prometheus-side counter-reset detection.
Cardinality Warning: Label cardinality scales with the number of distinct commands executed per connection. A typical Valkey workload exposes a few dozen; modules like RediSearch add another handful. If your workload uses a very large module surface, monitor scrape size accordingly.
Inference Latency Metrics
Percentile latency for inference-shaped buckets, sourced from per-entry duration tables (command_log_entries on Valkey 8.1+, slowlog_entries elsewhere). Buckets are FT.SEARCH:<index-name> per configured index, plus aggregate read (GET/MGET) and write (SET/HSET family). Populated by the InferenceLatencyService evaluation loop; stale labels are removed when a bucket disappears.
| Metric | Type | Labels | Description | Example |
|---|---|---|---|---|
betterdb_inference_bucket_p50_us | gauge | bucket | p50 latency in microseconds for an inference bucket | 4200 |
betterdb_inference_bucket_p95_us | gauge | bucket | p95 latency in microseconds for an inference bucket | 18500 |
betterdb_inference_bucket_p99_us | gauge | bucket | p99 latency in microseconds for an inference bucket | 31000 |
betterdb_inference_unhealthy | gauge | bucket | Whether a bucket is unhealthy (p50 > 10 ms for FT.SEARCH:*): 1 unhealthy, 0 healthy | 0 |
betterdb_inference_sla_breach | gauge | index | Whether the configured per-index p99 SLA is currently breached: 1 breached, 0 ok | 0 |
Threshold-gating bias: the source tables only store entries slower than the configured threshold directive (commandlog-execution-slower-than or slowlog-log-slower-than), so percentiles skew toward the tail. The /inference-latency/profile HTTP response exposes the active directive + value so consumers can qualify the number.
Cardinality Warning: FT.SEARCH buckets scale with the number of vector indexes per connection. inference_sla_breach only emits for indexes with an active SLA configured (Pro tier). Aggregate read / write buckets are constant cardinality.
Server Info Metrics
Basic server identification and uptime.
| Metric | Type | Labels | Description | Example |
|---|---|---|---|---|
betterdb_uptime_in_seconds | gauge | - | Server uptime in seconds | 864000 |
betterdb_instance_info | gauge | version, role, os | Instance information (always 1) | 1 |
Label Example: version="8.0.1", role="master", os="Linux 5.15.0"
Memory Metrics
Detailed memory usage and fragmentation tracking.
| Metric | Type | Labels | Description | Example |
|---|---|---|---|---|
betterdb_memory_used_bytes | gauge | - | Total allocated memory in bytes | 1073741824 |
betterdb_memory_used_rss_bytes | gauge | - | RSS memory usage in bytes | 1200000000 |
betterdb_memory_used_peak_bytes | gauge | - | Peak memory usage in bytes | 1500000000 |
betterdb_memory_max_bytes | gauge | - | Maximum memory limit in bytes (0 if unlimited) | 2147483648 |
betterdb_memory_fragmentation_ratio | gauge | - | Memory fragmentation ratio | 1.15 |
betterdb_memory_fragmentation_bytes | gauge | - | Memory fragmentation in bytes | 126000000 |
Stats Metrics
Operational statistics and throughput.
| Metric | Type | Labels | Description | Example |
|---|---|---|---|---|
betterdb_connections_received_total | gauge | - | Total connections received | 45678 |
betterdb_commands_processed_total | gauge | - | Total commands processed | 12456789 |
betterdb_instantaneous_ops_per_sec | gauge | - | Current operations per second | 2500 |
betterdb_instantaneous_input_kbps | gauge | - | Current input kilobytes per second | 125.5 |
betterdb_instantaneous_output_kbps | gauge | - | Current output kilobytes per second | 856.3 |
betterdb_keyspace_hits_total | gauge | - | Total keyspace hits | 9876543 |
betterdb_keyspace_misses_total | gauge | - | Total keyspace misses | 234567 |
betterdb_evicted_keys_total | gauge | - | Total evicted keys | 1234 |
betterdb_expired_keys_total | gauge | - | Total expired keys | 56789 |
betterdb_pubsub_channels | gauge | - | Number of pub/sub channels | 12 |
betterdb_pubsub_patterns | gauge | - | Number of pub/sub patterns | 3 |
CPU Metrics
Server CPU consumption from the Valkey/Redis INFO CPU section.
| Metric | Type | Labels | Description | Example |
|---|---|---|---|---|
betterdb_cpu_sys_seconds_total | gauge | connection | Cumulative system CPU time consumed by the server in seconds | 123.45 |
betterdb_cpu_user_seconds_total | gauge | connection | Cumulative user CPU time consumed by the server in seconds | 456.78 |
System vs User CPU
cpu_sys_seconds_total tracks time the CPU spent in kernel space on behalf of the Valkey process - network I/O syscalls, memory allocation, and other OS-level operations. cpu_user_seconds_total tracks time spent executing Valkey’s own code in userspace - command processing, data structure operations, Lua scripts, and so on.
For a lightly loaded instance, system CPU is typically higher than user because most work is network I/O. A spike in user CPU points to CPU-intensive commands (large SORT operations, complex Lua scripts, big key scans). A spike in system CPU points to network or memory pressure.
Note: These are cumulative counters exposed as gauges. Use rate() in PromQL to compute per-second CPU usage.
Replication Metrics
Replication status and offset tracking.
| Metric | Type | Labels | Description | Example |
|---|---|---|---|---|
betterdb_connected_slaves | gauge | - | Number of connected replicas | 2 |
betterdb_replication_offset | gauge | - | Replication offset | 123456789 |
betterdb_master_link_up | gauge | - | 1 if link to master is up (replica only) | 1 |
betterdb_master_last_io_seconds_ago | gauge | - | Seconds since last I/O with master (replica only) | 2 |
Keyspace Metrics
Per-database key statistics.
| Metric | Type | Labels | Description | Example |
|---|---|---|---|---|
betterdb_db_keys | gauge | db | Total keys in database | 125000 |
betterdb_db_keys_expiring | gauge | db | Keys with expiration in database | 45000 |
betterdb_db_avg_ttl_seconds | gauge | db | Average TTL in seconds | 3600 |
Label Example: db="db0", db="db1"
Cluster Metrics
Cluster mode health and slot distribution.
| Metric | Type | Labels | Description | Example |
|---|---|---|---|---|
betterdb_cluster_enabled | gauge | - | 1 if cluster mode is enabled | 1 |
betterdb_cluster_known_nodes | gauge | - | Number of known cluster nodes | 6 |
betterdb_cluster_size | gauge | - | Number of master nodes in cluster | 3 |
betterdb_cluster_slots_assigned | gauge | - | Number of assigned slots | 16384 |
betterdb_cluster_slots_ok | gauge | - | Number of slots in OK state | 16384 |
betterdb_cluster_slots_fail | gauge | - | Number of slots in FAIL state | 0 |
betterdb_cluster_slots_pfail | gauge | - | Number of slots in PFAIL state | 0 |
Cluster Slot Metrics (Valkey 8.0+)
| Metric | Type | Labels | Description | Example |
|---|---|---|---|---|
betterdb_cluster_slot_keys | gauge | slot | Keys in cluster slot | 512 |
betterdb_cluster_slot_expires | gauge | slot | Expiring keys in cluster slot | 128 |
betterdb_cluster_slot_reads_total | gauge | slot | Total reads for cluster slot | 45678 |
betterdb_cluster_slot_writes_total | gauge | slot | Total writes for cluster slot | 12345 |
Availability: Only populated when connected to Valkey 8.0+ cluster. Limited to top 100 slots by key count.
Anomaly Detection Metrics
Real-time anomaly detection system metrics.
Event Metrics
| Metric | Type | Labels | Description | Example |
|---|---|---|---|---|
betterdb_anomaly_events_total | counter | severity, metric_type, anomaly_type | Total anomaly events detected | 42 |
betterdb_anomaly_events_current | gauge | severity | Unresolved anomalies by severity | 3 |
betterdb_anomaly_by_severity | gauge | severity | Anomalies in last hour by severity | 12 |
betterdb_anomaly_by_metric | gauge | metric_type | Anomalies in last hour by metric | 8 |
Label Values:
severity:info,warning,criticalmetric_type:connections,ops_per_sec,memory_used,input_kbps,output_kbps,slowlog_last_id,acl_denied,evicted_keys,blocked_clients,keyspace_misses,fragmentation_ratio,cpu_utilization,replication_roleanomaly_type:spike,drop
Correlation Metrics
| Metric | Type | Labels | Description | Example |
|---|---|---|---|---|
betterdb_correlated_groups_total | counter | pattern, severity | Total correlated anomaly groups | 15 |
betterdb_correlated_groups_by_severity | gauge | severity | Groups in last hour by severity | 8 |
betterdb_correlated_groups_by_pattern | gauge | pattern | Groups in last hour by pattern | 5 |
Pattern Values: traffic_burst, batch_job, memory_pressure, slow_queries, auth_attack, connection_leak, cache_thrashing, node_failover, unknown
Buffer Stats Metrics
| Metric | Type | Labels | Description | Example |
|---|---|---|---|---|
betterdb_anomaly_buffer_ready | gauge | metric_type | Buffer ready state (1=ready, 0=warming) | 1 |
betterdb_anomaly_buffer_mean | gauge | metric_type | Rolling mean for anomaly detection | 2450 |
betterdb_anomaly_buffer_stddev | gauge | metric_type | Rolling stddev for anomaly detection | 125.5 |
Internal Metrics
BetterDB Monitor application health metrics.
| Metric | Type | Labels | Description | Example |
|---|---|---|---|---|
betterdb_polls_total | counter | - | Total number of poll cycles completed | 123456 |
betterdb_poll_duration_seconds | histogram | service | Duration of poll cycles in seconds | buckets: 0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10 |
Service Values: Names of polling services (audit, client-analytics, metrics, etc.)
Node.js Process Metrics
Standard process metrics provided by prom-client with betterdb_ prefix.
CPU & Memory
| Metric | Type | Description |
|---|---|---|
betterdb_process_cpu_user_seconds_total | counter | Total user CPU time spent in seconds |
betterdb_process_cpu_system_seconds_total | counter | Total system CPU time spent in seconds |
betterdb_process_cpu_seconds_total | counter | Total user and system CPU time spent in seconds |
betterdb_process_resident_memory_bytes | gauge | Resident memory size in bytes |
betterdb_process_virtual_memory_bytes | gauge | Virtual memory size in bytes |
betterdb_process_heap_bytes | gauge | Process heap size in bytes |
File Descriptors
| Metric | Type | Description |
|---|---|---|
betterdb_process_open_fds | gauge | Number of open file descriptors |
betterdb_process_max_fds | gauge | Maximum number of open file descriptors |
Event Loop
| Metric | Type | Description |
|---|---|---|
betterdb_nodejs_eventloop_lag_seconds | gauge | Lag of event loop in seconds |
betterdb_nodejs_eventloop_lag_min_seconds | gauge | Minimum recorded event loop delay |
betterdb_nodejs_eventloop_lag_max_seconds | gauge | Maximum recorded event loop delay |
betterdb_nodejs_eventloop_lag_mean_seconds | gauge | Mean of recorded event loop delays |
betterdb_nodejs_eventloop_lag_stddev_seconds | gauge | Standard deviation of recorded event loop delays |
betterdb_nodejs_eventloop_lag_p50_seconds | gauge | 50th percentile of recorded event loop delays |
betterdb_nodejs_eventloop_lag_p90_seconds | gauge | 90th percentile of recorded event loop delays |
betterdb_nodejs_eventloop_lag_p99_seconds | gauge | 99th percentile of recorded event loop delays |
Heap & GC
| Metric | Type | Labels | Description |
|---|---|---|---|
betterdb_nodejs_heap_size_total_bytes | gauge | - | Process heap size from Node.js in bytes |
betterdb_nodejs_heap_size_used_bytes | gauge | - | Process heap size used from Node.js in bytes |
betterdb_nodejs_external_memory_bytes | gauge | - | Node.js external memory size in bytes |
betterdb_nodejs_heap_space_size_total_bytes | gauge | space | Process heap space size total in bytes |
betterdb_nodejs_heap_space_size_used_bytes | gauge | space | Process heap space size used in bytes |
betterdb_nodejs_heap_space_size_available_bytes | gauge | space | Process heap space size available in bytes |
betterdb_nodejs_gc_duration_seconds | histogram | kind | Garbage collection duration (major, minor, incremental, weakcb) |
Scrape Configuration
Basic Prometheus Configuration
scrape_configs:
- job_name: 'betterdb'
static_configs:
- targets: ['localhost:3001']
metrics_path: '/prometheus/metrics'
scrape_interval: 15s
scrape_timeout: 10s
Multi-Instance Setup
scrape_configs:
- job_name: 'betterdb'
static_configs:
- targets:
- 'betterdb-prod-1:3001'
- 'betterdb-prod-2:3001'
- 'betterdb-staging:3001'
labels:
env: 'production'
metrics_path: '/prometheus/metrics'
scrape_interval: 15s
With Service Discovery (Kubernetes)
scrape_configs:
- job_name: 'betterdb'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: betterdb-monitor
- source_labels: [__meta_kubernetes_pod_ip]
action: replace
target_label: __address__
replacement: '${1}:3001'
metrics_path: '/prometheus/metrics'
scrape_interval: 15s
Useful PromQL Queries
Anomaly Detection
# Anomaly detection rate (events per minute)
rate(betterdb_anomaly_events_total[5m]) * 60
# Critical anomalies in last hour
betterdb_anomaly_by_severity{severity="critical"}
# Detection system readiness percentage
sum(betterdb_anomaly_buffer_ready) / count(betterdb_anomaly_buffer_ready) * 100
# Memory pressure incidents in last hour
increase(betterdb_correlated_groups_total{pattern="memory_pressure"}[1h])
# Top metrics causing anomalies
topk(5, betterdb_anomaly_by_metric)
# Unresolved critical anomalies
betterdb_anomaly_events_current{severity="critical"}
CPU Utilization
# Per-second CPU usage (system + user combined)
rate(betterdb_cpu_sys_seconds_total[5m]) + rate(betterdb_cpu_user_seconds_total[5m])
# System vs user CPU breakdown
rate(betterdb_cpu_sys_seconds_total[5m])
rate(betterdb_cpu_user_seconds_total[5m])
Memory & Performance
# Memory usage percentage (if maxmemory is set)
(betterdb_memory_used_bytes / betterdb_memory_max_bytes) * 100
# Memory fragmentation ratio (alert if > 1.5)
betterdb_memory_fragmentation_ratio
# Cache hit rate percentage
(betterdb_keyspace_hits_total / (betterdb_keyspace_hits_total + betterdb_keyspace_misses_total)) * 100
# Operations per second trend
rate(betterdb_commands_processed_total[5m])
# Network throughput (combined input + output)
betterdb_instantaneous_input_kbps + betterdb_instantaneous_output_kbps
Client Analytics
# Connection growth rate
rate(betterdb_connections_received_total[5m])
# Current connection count by user
sum by (user) (betterdb_client_connections_by_user)
# Peak vs current connections
betterdb_client_connections_peak - betterdb_client_connections_current
Slowlog Analysis
# Top 5 slow query patterns
topk(5, betterdb_slowlog_pattern_count)
# Slowest query patterns by average duration
topk(5, betterdb_slowlog_pattern_avg_duration_us)
# Slowlog growth rate
rate(betterdb_slowlog_length[5m])
Cluster Health
# Cluster slot health percentage
(betterdb_cluster_slots_ok / betterdb_cluster_slots_assigned) * 100
# Failed slots alert
betterdb_cluster_slots_fail + betterdb_cluster_slots_pfail
# Replication lag (for replicas)
betterdb_master_last_io_seconds_ago
Application Health
# BetterDB Monitor event loop lag (alert if > 100ms)
betterdb_nodejs_eventloop_lag_p99_seconds > 0.1
# Poll duration 99th percentile
histogram_quantile(0.99, rate(betterdb_poll_duration_seconds_bucket[5m]))
# High cardinality metric check (client names)
count(betterdb_client_connections_by_name)
Alertmanager Rules
The following alert rules are production-ready. See docs/alertmanager-rules.yml for the complete YAML configuration.
Critical Alerts
BetterDBCriticalAnomaly - Fires immediately when a critical anomaly is detected
increase(betterdb_anomaly_events_total{severity="critical"}[5m]) > 0
BetterDBMemoryPressure - Memory pressure pattern detected
increase(betterdb_correlated_groups_total{pattern="memory_pressure"}[10m]) > 0
BetterDBAuthAnomaly - Potential authentication attack
increase(betterdb_correlated_groups_total{pattern="auth_attack"}[5m]) > 0
Warning Alerts
BetterDBWarningSpike - Multiple warning anomalies in short period
increase(betterdb_anomaly_events_total{severity="warning"}[5m]) > 5
BetterDBConnectionLeak - Possible connection leak pattern
increase(betterdb_correlated_groups_total{pattern="connection_leak"}[10m]) > 0
for: 5m
BetterDBTrafficBurst - Traffic burst detected
increase(betterdb_correlated_groups_total{pattern="traffic_burst"}[5m]) > 0
BetterDBUnresolvedCriticalAnomalies - Multiple unresolved critical anomalies
betterdb_anomaly_events_current{severity="critical"} > 3
for: 10m
BetterDBPersistentAnomalies - Persistent anomalies over time
betterdb_anomaly_by_severity{severity!="info"} > 10
for: 30m
Info Alerts
BetterDBAnomalyDetectionWarming - Anomaly detection system warming up
(sum(betterdb_anomaly_buffer_ready) / count(betterdb_anomaly_buffer_ready)) < 1
for: 5m
Grafana Integration
Import Ready-Made Dashboard
- Navigate to Grafana → Dashboards → Import
- Use BetterDB Monitor dashboard ID:
[Coming Soon] - Select your Prometheus datasource
- Click Import
Creating Custom Dashboards
Recommended Panels:
- Anomaly Overview - Gauge showing unresolved critical anomalies
- Anomaly Timeline - Graph of
rate(betterdb_anomaly_events_total[5m])by severity - Pattern Detection - Bar chart of
betterdb_correlated_groups_by_pattern - Memory Usage - Graph showing
betterdb_memory_used_bytesvsbetterdb_memory_max_bytes - Cache Hit Rate - Graph showing cache hit rate percentage
- Connection Trends - Graph of
betterdb_client_connections_currentand peak - Slow Query Patterns - Table showing top patterns from
betterdb_slowlog_pattern_* - Buffer Readiness - Heatmap of
betterdb_anomaly_buffer_readyby metric type
Example Panel Query (Memory Usage)
{
"expr": "betterdb_memory_used_bytes",
"legendFormat": "Used Memory",
"refId": "A"
},
{
"expr": "betterdb_memory_max_bytes",
"legendFormat": "Max Memory",
"refId": "B"
}
Configuration
Metrics Update Interval
The anomaly detection Prometheus summary is updated every 30 seconds by default. Configure via:
ANOMALY_PROMETHEUS_INTERVAL_MS=30000
Or update at runtime via the /settings API endpoint:
curl -X PUT http://localhost:3001/settings \
-H "Content-Type: application/json" \
-d '{"anomalyPrometheusIntervalMs": 15000}'
Cardinality Management
High-cardinality labels can impact Prometheus performance. Monitor these metrics:
betterdb_client_connections_by_name- Scales with unique client namesbetterdb_client_connections_by_user- Scales with unique usernamesbetterdb_cluster_slot_*- Limited to top 100 slots automatically
If cardinality becomes an issue, consider:
- Aggregating client names using
relabel_configsin Prometheus - Filtering specific labels using
metric_relabel_configs - Reducing retention period for client analytics data
Troubleshooting
Missing Metrics
COMMANDLOG metrics not appearing?
- Check Valkey version: Requires Valkey 8.1+
- Verify connection: Ensure BetterDB is connected to Valkey (not Redis)
Cluster slot metrics not appearing?
- Check Valkey version: Requires Valkey 8.0+
- Verify cluster mode: Ensure the instance is in cluster mode
Anomaly metrics showing zeros?
- Wait for warmup: Anomaly detection requires 30 samples (30 seconds at 1s poll rate)
- Check buffer readiness: Query
betterdb_anomaly_buffer_ready
High Scrape Duration
If /prometheus/metrics takes >1s to respond:
- Reduce slowlog analysis sample size (default: 128 entries)
- Reduce cluster slot stats limit (default: 100 slots)
- Increase scrape timeout in Prometheus config
- Check if database is responding slowly
Stale Metrics
If metrics appear outdated:
- Verify BetterDB Monitor is running: Check
betterdb_process_start_time_seconds - Check database connectivity: Review
/healthendpoint - Verify polling services: Check
betterdb_polls_totalis incrementing