Prometheus Metrics Reference

Complete reference for all metrics exposed by BetterDB Monitor at the /prometheus/metrics endpoint.

Table of Contents

Overview

BetterDB Monitor exposes Prometheus-compatible metrics at:

GET /prometheus/metrics
Content-Type: text/plain; version=0.0.4; charset=utf-8

All custom metrics are prefixed with betterdb_. Standard Node.js process metrics from prom-client are also included with the same prefix.

Scrape Interval: Recommended 15s Metrics Update: Metrics are computed on-demand during each scrape

Metrics Categories

ACL Audit Metrics

Track ACL denied events captured from the monitored Valkey/Redis instance.

Metric Type Labels Description Example
betterdb_acl_denied gauge - Total ACL denied events captured 42
betterdb_acl_denied_by_reason gauge reason ACL denied events by reason (auth, command, key, channel) 15
betterdb_acl_denied_by_user gauge username ACL denied events by username 8

Cardinality Warning: betterdb_acl_denied_by_user cardinality scales with number of unique usernames experiencing failures.

Client Analytics Metrics

Monitor client connection patterns and trends.

Metric Type Labels Description Example
betterdb_client_connections_current gauge - Current number of client connections 127
betterdb_client_connections_peak gauge - Peak connections in retention period 256
betterdb_client_connections_by_name gauge client_name Current connections by client name 12
betterdb_client_connections_by_user gauge user Current connections by ACL user 25

Cardinality Warning: Label-based metrics scale with unique client names and usernames.

Slowlog Metrics

Analyze slow query patterns aggregated from SLOWLOG data.

Metric Type Labels Description Example
betterdb_slowlog_length gauge - Current slowlog length 128
betterdb_slowlog_last_id gauge - ID of last slowlog entry 12345
betterdb_slowlog_pattern_count gauge pattern Number of slow queries per pattern 24
betterdb_slowlog_pattern_avg_duration_us gauge pattern Average duration in microseconds per pattern 1250000
betterdb_slowlog_pattern_percentage gauge pattern Percentage of slow queries per pattern 18.75

Pattern Examples: GET *, HGETALL *, SCAN *

COMMANDLOG Metrics (Valkey 8.1+)

Valkey-specific metrics for tracking large request/reply commands.

Metric Type Labels Description Example
betterdb_commandlog_large_request gauge - Total large request entries 15
betterdb_commandlog_large_reply gauge - Total large reply entries 8
betterdb_commandlog_large_request_by_pattern gauge pattern Large request count by command pattern 5
betterdb_commandlog_large_reply_by_pattern gauge pattern Large reply count by command pattern 3

Availability: Only populated when connected to Valkey 8.1+. Returns no data for Redis or older Valkey versions.

Server Info Metrics

Basic server identification and uptime.

Metric Type Labels Description Example
betterdb_uptime_in_seconds gauge - Server uptime in seconds 864000
betterdb_instance_info gauge version, role, os Instance information (always 1) 1

Label Example: version="8.0.1", role="master", os="Linux 5.15.0"

Memory Metrics

Detailed memory usage and fragmentation tracking.

Metric Type Labels Description Example
betterdb_memory_used_bytes gauge - Total allocated memory in bytes 1073741824
betterdb_memory_used_rss_bytes gauge - RSS memory usage in bytes 1200000000
betterdb_memory_used_peak_bytes gauge - Peak memory usage in bytes 1500000000
betterdb_memory_max_bytes gauge - Maximum memory limit in bytes (0 if unlimited) 2147483648
betterdb_memory_fragmentation_ratio gauge - Memory fragmentation ratio 1.15
betterdb_memory_fragmentation_bytes gauge - Memory fragmentation in bytes 126000000

Stats Metrics

Operational statistics and throughput.

Metric Type Labels Description Example
betterdb_connections_received_total gauge - Total connections received 45678
betterdb_commands_processed_total gauge - Total commands processed 12456789
betterdb_instantaneous_ops_per_sec gauge - Current operations per second 2500
betterdb_instantaneous_input_kbps gauge - Current input kilobytes per second 125.5
betterdb_instantaneous_output_kbps gauge - Current output kilobytes per second 856.3
betterdb_keyspace_hits_total gauge - Total keyspace hits 9876543
betterdb_keyspace_misses_total gauge - Total keyspace misses 234567
betterdb_evicted_keys_total gauge - Total evicted keys 1234
betterdb_expired_keys_total gauge - Total expired keys 56789
betterdb_pubsub_channels gauge - Number of pub/sub channels 12
betterdb_pubsub_patterns gauge - Number of pub/sub patterns 3

Replication Metrics

Replication status and offset tracking.

Metric Type Labels Description Example
betterdb_connected_slaves gauge - Number of connected replicas 2
betterdb_replication_offset gauge - Replication offset 123456789
betterdb_master_link_up gauge - 1 if link to master is up (replica only) 1
betterdb_master_last_io_seconds_ago gauge - Seconds since last I/O with master (replica only) 2

Keyspace Metrics

Per-database key statistics.

Metric Type Labels Description Example
betterdb_db_keys gauge db Total keys in database 125000
betterdb_db_keys_expiring gauge db Keys with expiration in database 45000
betterdb_db_avg_ttl_seconds gauge db Average TTL in seconds 3600

Label Example: db="db0", db="db1"

Cluster Metrics

Cluster mode health and slot distribution.

Metric Type Labels Description Example
betterdb_cluster_enabled gauge - 1 if cluster mode is enabled 1
betterdb_cluster_known_nodes gauge - Number of known cluster nodes 6
betterdb_cluster_size gauge - Number of master nodes in cluster 3
betterdb_cluster_slots_assigned gauge - Number of assigned slots 16384
betterdb_cluster_slots_ok gauge - Number of slots in OK state 16384
betterdb_cluster_slots_fail gauge - Number of slots in FAIL state 0
betterdb_cluster_slots_pfail gauge - Number of slots in PFAIL state 0

Cluster Slot Metrics (Valkey 8.0+)

Metric Type Labels Description Example
betterdb_cluster_slot_keys gauge slot Keys in cluster slot 512
betterdb_cluster_slot_expires gauge slot Expiring keys in cluster slot 128
betterdb_cluster_slot_reads_total gauge slot Total reads for cluster slot 45678
betterdb_cluster_slot_writes_total gauge slot Total writes for cluster slot 12345

Availability: Only populated when connected to Valkey 8.0+ cluster. Limited to top 100 slots by key count.

Anomaly Detection Metrics

Real-time anomaly detection system metrics.

Event Metrics

Metric Type Labels Description Example
betterdb_anomaly_events_total counter severity, metric_type, anomaly_type Total anomaly events detected 42
betterdb_anomaly_events_current gauge severity Unresolved anomalies by severity 3
betterdb_anomaly_by_severity gauge severity Anomalies in last hour by severity 12
betterdb_anomaly_by_metric gauge metric_type Anomalies in last hour by metric 8

Label Values:

  • severity: info, warning, critical
  • metric_type: connections, ops_per_sec, memory_used, input_kbps, output_kbps, slowlog_count, acl_denied, evicted_keys, blocked_clients, keyspace_misses, fragmentation_ratio
  • anomaly_type: spike, drop

Correlation Metrics

Metric Type Labels Description Example
betterdb_correlated_groups_total counter pattern, severity Total correlated anomaly groups 15
betterdb_correlated_groups_by_severity gauge severity Groups in last hour by severity 8
betterdb_correlated_groups_by_pattern gauge pattern Groups in last hour by pattern 5

Pattern Values: traffic_burst, batch_job, memory_pressure, slow_queries, auth_attack, connection_leak, cache_thrashing, unknown

Buffer Stats Metrics

Metric Type Labels Description Example
betterdb_anomaly_buffer_ready gauge metric_type Buffer ready state (1=ready, 0=warming) 1
betterdb_anomaly_buffer_mean gauge metric_type Rolling mean for anomaly detection 2450
betterdb_anomaly_buffer_stddev gauge metric_type Rolling stddev for anomaly detection 125.5

Internal Metrics

BetterDB Monitor application health metrics.

Metric Type Labels Description Example
betterdb_polls_total counter - Total number of poll cycles completed 123456
betterdb_poll_duration_seconds histogram service Duration of poll cycles in seconds buckets: 0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10

Service Values: Names of polling services (audit, client-analytics, metrics, etc.)

Node.js Process Metrics

Standard process metrics provided by prom-client with betterdb_ prefix.

CPU & Memory

Metric Type Description
betterdb_process_cpu_user_seconds_total counter Total user CPU time spent in seconds
betterdb_process_cpu_system_seconds_total counter Total system CPU time spent in seconds
betterdb_process_cpu_seconds_total counter Total user and system CPU time spent in seconds
betterdb_process_resident_memory_bytes gauge Resident memory size in bytes
betterdb_process_virtual_memory_bytes gauge Virtual memory size in bytes
betterdb_process_heap_bytes gauge Process heap size in bytes

File Descriptors

Metric Type Description
betterdb_process_open_fds gauge Number of open file descriptors
betterdb_process_max_fds gauge Maximum number of open file descriptors

Event Loop

Metric Type Description
betterdb_nodejs_eventloop_lag_seconds gauge Lag of event loop in seconds
betterdb_nodejs_eventloop_lag_min_seconds gauge Minimum recorded event loop delay
betterdb_nodejs_eventloop_lag_max_seconds gauge Maximum recorded event loop delay
betterdb_nodejs_eventloop_lag_mean_seconds gauge Mean of recorded event loop delays
betterdb_nodejs_eventloop_lag_stddev_seconds gauge Standard deviation of recorded event loop delays
betterdb_nodejs_eventloop_lag_p50_seconds gauge 50th percentile of recorded event loop delays
betterdb_nodejs_eventloop_lag_p90_seconds gauge 90th percentile of recorded event loop delays
betterdb_nodejs_eventloop_lag_p99_seconds gauge 99th percentile of recorded event loop delays

Heap & GC

Metric Type Labels Description
betterdb_nodejs_heap_size_total_bytes gauge - Process heap size from Node.js in bytes
betterdb_nodejs_heap_size_used_bytes gauge - Process heap size used from Node.js in bytes
betterdb_nodejs_external_memory_bytes gauge - Node.js external memory size in bytes
betterdb_nodejs_heap_space_size_total_bytes gauge space Process heap space size total in bytes
betterdb_nodejs_heap_space_size_used_bytes gauge space Process heap space size used in bytes
betterdb_nodejs_heap_space_size_available_bytes gauge space Process heap space size available in bytes
betterdb_nodejs_gc_duration_seconds histogram kind Garbage collection duration (major, minor, incremental, weakcb)

Scrape Configuration

Basic Prometheus Configuration

scrape_configs:
  - job_name: 'betterdb'
    static_configs:
      - targets: ['localhost:3001']
    metrics_path: '/prometheus/metrics'
    scrape_interval: 15s
    scrape_timeout: 10s

Multi-Instance Setup

scrape_configs:
  - job_name: 'betterdb'
    static_configs:
      - targets:
        - 'betterdb-prod-1:3001'
        - 'betterdb-prod-2:3001'
        - 'betterdb-staging:3001'
        labels:
          env: 'production'
    metrics_path: '/prometheus/metrics'
    scrape_interval: 15s

With Service Discovery (Kubernetes)

scrape_configs:
  - job_name: 'betterdb'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: betterdb-monitor
      - source_labels: [__meta_kubernetes_pod_ip]
        action: replace
        target_label: __address__
        replacement: '${1}:3001'
    metrics_path: '/prometheus/metrics'
    scrape_interval: 15s

Useful PromQL Queries

Anomaly Detection

# Anomaly detection rate (events per minute)
rate(betterdb_anomaly_events_total[5m]) * 60

# Critical anomalies in last hour
betterdb_anomaly_by_severity{severity="critical"}

# Detection system readiness percentage
sum(betterdb_anomaly_buffer_ready) / count(betterdb_anomaly_buffer_ready) * 100

# Memory pressure incidents in last hour
increase(betterdb_correlated_groups_total{pattern="memory_pressure"}[1h])

# Top metrics causing anomalies
topk(5, betterdb_anomaly_by_metric)

# Unresolved critical anomalies
betterdb_anomaly_events_current{severity="critical"}

Memory & Performance

# Memory usage percentage (if maxmemory is set)
(betterdb_memory_used_bytes / betterdb_memory_max_bytes) * 100

# Memory fragmentation ratio (alert if > 1.5)
betterdb_memory_fragmentation_ratio

# Cache hit rate percentage
(betterdb_keyspace_hits_total / (betterdb_keyspace_hits_total + betterdb_keyspace_misses_total)) * 100

# Operations per second trend
rate(betterdb_commands_processed_total[5m])

# Network throughput (combined input + output)
betterdb_instantaneous_input_kbps + betterdb_instantaneous_output_kbps

Client Analytics

# Connection growth rate
rate(betterdb_connections_received_total[5m])

# Current connection count by user
sum by (user) (betterdb_client_connections_by_user)

# Peak vs current connections
betterdb_client_connections_peak - betterdb_client_connections_current

Slowlog Analysis

# Top 5 slow query patterns
topk(5, betterdb_slowlog_pattern_count)

# Slowest query patterns by average duration
topk(5, betterdb_slowlog_pattern_avg_duration_us)

# Slowlog growth rate
rate(betterdb_slowlog_length[5m])

Cluster Health

# Cluster slot health percentage
(betterdb_cluster_slots_ok / betterdb_cluster_slots_assigned) * 100

# Failed slots alert
betterdb_cluster_slots_fail + betterdb_cluster_slots_pfail

# Replication lag (for replicas)
betterdb_master_last_io_seconds_ago

Application Health

# BetterDB Monitor event loop lag (alert if > 100ms)
betterdb_nodejs_eventloop_lag_p99_seconds > 0.1

# Poll duration 99th percentile
histogram_quantile(0.99, rate(betterdb_poll_duration_seconds_bucket[5m]))

# High cardinality metric check (client names)
count(betterdb_client_connections_by_name)

Alertmanager Rules

The following alert rules are production-ready. See docs/alertmanager-rules.yml for the complete YAML configuration.

Critical Alerts

BetterDBCriticalAnomaly - Fires immediately when a critical anomaly is detected

increase(betterdb_anomaly_events_total{severity="critical"}[5m]) > 0

BetterDBMemoryPressure - Memory pressure pattern detected

increase(betterdb_correlated_groups_total{pattern="memory_pressure"}[10m]) > 0

BetterDBAuthAnomaly - Potential authentication attack

increase(betterdb_correlated_groups_total{pattern="auth_attack"}[5m]) > 0

Warning Alerts

BetterDBWarningSpike - Multiple warning anomalies in short period

increase(betterdb_anomaly_events_total{severity="warning"}[5m]) > 5

BetterDBConnectionLeak - Possible connection leak pattern

increase(betterdb_correlated_groups_total{pattern="connection_leak"}[10m]) > 0
for: 5m

BetterDBTrafficBurst - Traffic burst detected

increase(betterdb_correlated_groups_total{pattern="traffic_burst"}[5m]) > 0

BetterDBUnresolvedCriticalAnomalies - Multiple unresolved critical anomalies

betterdb_anomaly_events_current{severity="critical"} > 3
for: 10m

BetterDBPersistentAnomalies - Persistent anomalies over time

betterdb_anomaly_by_severity{severity!="info"} > 10
for: 30m

Info Alerts

BetterDBAnomalyDetectionWarming - Anomaly detection system warming up

(sum(betterdb_anomaly_buffer_ready) / count(betterdb_anomaly_buffer_ready)) < 1
for: 5m

Grafana Integration

Import Ready-Made Dashboard

  1. Navigate to Grafana → Dashboards → Import
  2. Use BetterDB Monitor dashboard ID: [Coming Soon]
  3. Select your Prometheus datasource
  4. Click Import

Creating Custom Dashboards

Recommended Panels:

  1. Anomaly Overview - Gauge showing unresolved critical anomalies
  2. Anomaly Timeline - Graph of rate(betterdb_anomaly_events_total[5m]) by severity
  3. Pattern Detection - Bar chart of betterdb_correlated_groups_by_pattern
  4. Memory Usage - Graph showing betterdb_memory_used_bytes vs betterdb_memory_max_bytes
  5. Cache Hit Rate - Graph showing cache hit rate percentage
  6. Connection Trends - Graph of betterdb_client_connections_current and peak
  7. Slow Query Patterns - Table showing top patterns from betterdb_slowlog_pattern_*
  8. Buffer Readiness - Heatmap of betterdb_anomaly_buffer_ready by metric type

Example Panel Query (Memory Usage)

{
  "expr": "betterdb_memory_used_bytes",
  "legendFormat": "Used Memory",
  "refId": "A"
},
{
  "expr": "betterdb_memory_max_bytes",
  "legendFormat": "Max Memory",
  "refId": "B"
}

Configuration

Metrics Update Interval

The anomaly detection Prometheus summary is updated every 30 seconds by default. Configure via:

ANOMALY_PROMETHEUS_INTERVAL_MS=30000

Or update at runtime via the /settings API endpoint:

curl -X PUT http://localhost:3001/settings \
  -H "Content-Type: application/json" \
  -d '{"anomalyPrometheusIntervalMs": 15000}'

Cardinality Management

High-cardinality labels can impact Prometheus performance. Monitor these metrics:

  • betterdb_client_connections_by_name - Scales with unique client names
  • betterdb_client_connections_by_user - Scales with unique usernames
  • betterdb_cluster_slot_* - Limited to top 100 slots automatically

If cardinality becomes an issue, consider:

  • Aggregating client names using relabel_configs in Prometheus
  • Filtering specific labels using metric_relabel_configs
  • Reducing retention period for client analytics data

Troubleshooting

Missing Metrics

COMMANDLOG metrics not appearing?

  • Check Valkey version: Requires Valkey 8.1+
  • Verify connection: Ensure BetterDB is connected to Valkey (not Redis)

Cluster slot metrics not appearing?

  • Check Valkey version: Requires Valkey 8.0+
  • Verify cluster mode: Ensure the instance is in cluster mode

Anomaly metrics showing zeros?

  • Wait for warmup: Anomaly detection requires 30 samples (30 seconds at 1s poll rate)
  • Check buffer readiness: Query betterdb_anomaly_buffer_ready

High Scrape Duration

If /prometheus/metrics takes >1s to respond:

  • Reduce slowlog analysis sample size (default: 128 entries)
  • Reduce cluster slot stats limit (default: 100 slots)
  • Increase scrape timeout in Prometheus config
  • Check if database is responding slowly

Stale Metrics

If metrics appear outdated:

  • Verify BetterDB Monitor is running: Check betterdb_process_start_time_seconds
  • Check database connectivity: Review /health endpoint
  • Verify polling services: Check betterdb_polls_total is incrementing