Anomaly Detection Guide
Comprehensive guide to BetterDB’s real-time anomaly detection system for Valkey and Redis databases.
Table of Contents
- Overview
- How It Works
- Detected Patterns
- Severity Levels
- Monitored Metrics
- Configuration
- API Endpoints
- Tuning Guide
- Integration with Alerting
- Troubleshooting
Overview
BetterDB’s anomaly detection system continuously monitors your Valkey/Redis instance for unusual behavior patterns. It uses statistical analysis to establish baselines and detect deviations that could indicate problems before they impact your application.
Why Anomaly Detection Matters
Traditional monitoring relies on static thresholds (“alert if memory > 80%”), but these fail to catch:
- Gradual degradation - Slow memory leaks that don’t cross thresholds
- Unusual patterns - Connection spikes that are abnormal for your baseline
- Correlated issues - Multiple metrics changing together (memory + evictions + fragmentation)
- Attack patterns - Authentication failures that spike beyond normal rates
BetterDB’s detection adapts to your normal behavior and alerts when something deviates significantly.
Key Benefits
- Automatic baselining - No manual threshold configuration
- Pattern correlation - Identifies root causes by linking related anomalies
- Actionable recommendations - Each pattern includes specific remediation steps
- Low false positives - Requires multiple consecutive samples to confirm anomalies
- Prometheus integration - Export metrics for alerting and dashboards
How It Works
Architecture Flow
┌─────────────────┐
│ Valkey/Redis │
│ INFO │
└────────┬────────┘
│ Poll (1s intervals)
▼
┌─────────────────┐
│ Metric Extractor│ ← Extracts 11 key metrics
└────────┬────────┘
│
▼
┌─────────────────┐
│ Circular Buffer │ ← Stores last 300 samples (5 min)
│ (Per Metric) │ Calculates mean & stddev
└────────┬────────┘
│
▼
┌─────────────────┐
│ Spike Detector │ ← Z-score analysis
│ (Per Metric) │ Severity classification
└────────┬────────┘
│ Anomaly Events
▼
┌─────────────────┐
│ Correlator │ ← Groups related anomalies
│ │ Identifies patterns
└────────┬────────┘
│
▼
┌─────────────────┐
│ Pattern Groups │ ← Diagnosis + Recommendations
│ + Prometheus │
└─────────────────┘
1. Baseline Collection
Each monitored metric maintains a circular buffer of 300 samples (5 minutes at 1-second intervals):
- Minimum samples: 30 (warmup period)
- Rolling window: Last 300 samples only
- Statistics: Continuously calculates mean (μ) and standard deviation (σ)
Example: For the connections metric, if your baseline is 250 ± 25 connections, the buffer tracks this automatically.
2. Z-Score Calculation
For each new sample, calculate how many standard deviations it is from the mean:
Z-score = (current_value - mean) / stddev
Interpretation:
- Z = 0: Value is exactly at the mean
- Z = 2: Value is 2 standard deviations above mean (unusual)
- Z = 3: Value is 3 standard deviations above mean (very unusual)
- Z = -2: Value is 2 standard deviations below mean (drop)
3. Spike/Drop Detection
An anomaly is triggered when:
-
Z-score exceeds threshold (e.g., Z > 2.0 for warning) - OR absolute threshold exceeded (e.g., ACL denied > 50 for critical)
- Consecutive samples required (default: 3 consecutive to reduce noise)
- Cooldown period respected (default: 60s between alerts for same metric)
Severity determination:
- WARNING: Z ≥ 2.0 (or warning threshold)
- CRITICAL: Z ≥ 3.0 (or critical threshold)
4. Correlation of Related Anomalies
Every 5 seconds, the correlator examines recent anomalies and groups them by:
- Time proximity - Events within 5-second window
- Pattern matching - Specific combinations of metrics
Example: If connections, ops_per_sec, and memory_used all spike together within 5 seconds, this correlates to a BATCH_JOB pattern.
5. Pattern Diagnosis
Each pattern includes:
- Diagnosis - What the pattern means operationally
- Recommendations - Specific actions to investigate or remediate
- Severity - Inherited from highest severity anomaly in the group
Detected Patterns
AUTH_ATTACK
Triggers: Spike in acl_denied metric
What it means: Elevated ACL denial rate, possibly indicating:
- Brute force authentication attempt
- Misconfigured client credentials
- Expired or revoked ACL permissions
Recommended actions:
- Review ACL denied clients in the audit trail
- Check for suspicious IP addresses or patterns
- Consider implementing rate limiting or IP blocking
- Verify ACL rules are configured correctly
Example scenario: A client repeatedly tries wrong passwords, causing 50+ ACL denials in 10 seconds.
SLOW_QUERIES
Triggers: Spike in slowlog_count or blocked_clients metrics
What it means: Unusual number of slow queries, indicating:
- Operations on large data structures
- Blocking operations (BLPOP, BRPOP)
- Inefficient command patterns
- Potential deadlocks
Recommended actions:
- Review slow log entries to identify problematic commands
- Check for operations on large data structures
- Consider optimizing data access patterns
- Monitor blocked clients for potential deadlocks
Example scenario: Application starts scanning large hash keys, causing slowlog to grow from 10 to 100 entries in 30 seconds.
MEMORY_PRESSURE
Triggers:
- Spike in
memory_used(required) - Optionally with
evicted_keysorfragmentation_ratiospikes
What it means: Memory consumption elevated beyond normal, possibly due to:
- Large data import or bulk write
- Memory leak in application
- Lack of TTLs on new keys
- Insufficient maxmemory configuration
Recommended actions:
- Check memory usage trends and plan for scaling
- Review eviction policy settings
- Identify large keys or data structures
- Consider increasing maxmemory or adding shards
Example scenario: Memory usage jumps from 2GB to 3.5GB while evictions increase from 0 to 500/sec, and fragmentation rises to 1.8.
CACHE_THRASHING
Triggers: Concurrent spikes in keyspace_misses and evicted_keys
What it means: Cache is thrashing - keys are being evicted and immediately requested again:
- Working set exceeds available memory
- Poor cache hit rate due to eviction pressure
- Suboptimal eviction policy for workload
Recommended actions:
- Review cache hit ratio trends
- Check if working set exceeds available memory
- Consider increasing memory or adjusting TTLs
- Analyze access patterns for optimization opportunities
Example scenario: After deploying new feature, cache misses jump from 5% to 40% while evictions increase 10x.
CONNECTION_LEAK
Triggers:
- Spike in
connectionsWITHOUT corresponding spike inops_per_sec
What it means: Connections are accumulating without proportional traffic:
- Connection pool leak in application
- Clients not closing connections properly
- Long-lived idle connections accumulating
- Connection creation faster than cleanup
Recommended actions:
- Check for idle connections in client analytics
- Review client applications for connection pool leaks
- Set timeout parameters (timeout, tcp-keepalive)
- Monitor connection creation vs. closure rates
Example scenario: Connection count rises from 200 to 800 over 10 minutes, but ops/sec remains constant at 2000.
BATCH_JOB
Triggers: Concurrent spikes in connections, ops_per_sec, AND memory_used
What it means: Large-scale batch operation is running:
- Bulk data import job
- Backup or export process
- Migration script running
- Scheduled data processing
Recommended actions:
- Identify the client or job causing the spike
- Consider scheduling batch jobs during off-peak hours
- Implement rate limiting for bulk operations
- Monitor job duration and resource usage
Example scenario: Nightly ETL job starts, causing connections to spike from 100 to 500, ops/sec from 1000 to 15000, and memory from 1GB to 2GB.
TRAFFIC_BURST
Triggers:
- Spikes in
connectionsandops_per_secWITHOUT memory spike - OR spikes in
input_kbps/output_kbps
What it means: Sudden increase in legitimate traffic:
- Application feature launch
- Traffic surge (viral content, marketing campaign)
- Retry storm from upstream service
- Recurring traffic pattern (daily peak)
Recommended actions:
- Monitor traffic patterns for recurring spikes
- Ensure sufficient capacity for peak loads
- Review client connection pooling settings
- Consider implementing auto-scaling if cloud-hosted
Example scenario: Marketing campaign launches, ops/sec increases from 2000 to 12000, but memory usage remains stable.
UNKNOWN
Triggers: Anomalies that don’t match any defined pattern
What it means: Unusual behavior detected but correlation unclear:
- Novel issue not covered by patterns
- Single metric anomaly
- Metrics changed but pattern incomplete
Recommended actions:
- Investigate the specific metric trend
- Check for related system events
- Review application behavior during this time
- Correlate with external monitoring data
Example scenario: fragmentation_ratio spikes to 2.5 with no other metrics affected.
Severity Levels
INFO
Z-score range: Not currently used (reserved for future patterns)
Characteristics:
- Informational only
- No immediate action required
- Track over time for trends
Example: Minor fluctuation within expected variance
WARNING
Z-score threshold: ≥ 2.0 (or metric-specific warning threshold)
Characteristics:
- Noticeable deviation from baseline
- Should be investigated during business hours
- May indicate developing issue
- Typically requires 2-3 consecutive samples
Example: Connection count is 2.2 standard deviations above normal (Z=2.2)
Prometheus alert: Fire after 5 warnings in 5 minutes
CRITICAL
Z-score threshold: ≥ 3.0 (or metric-specific critical threshold)
Characteristics:
- Significant deviation from baseline
- Immediate investigation recommended
- Likely indicates active problem
- May require emergency response
Example: ACL denials are 3.5 standard deviations above normal (Z=3.5), suggesting authentication attack
Prometheus alert: Fire immediately on first critical event
Per-Metric Thresholds
Some metrics have custom thresholds beyond Z-score:
| Metric | Warning Threshold | Critical Threshold | Consecutive Required | Cooldown |
|---|---|---|---|---|
acl_denied | 10 events | 50 events | 2 | 30s |
slowlog_count | - | - | 2 | 30s |
memory_used | - | - | 3 | 60s |
evicted_keys | - | - | 2 | 30s |
fragmentation_ratio | 1.5 | 2.0 | 5 | 120s |
Monitored Metrics
connections
What it measures: Current number of client connections Why anomalies matter: Sudden spikes may indicate connection leaks, DDoS, or batch jobs Typical baseline: Varies by workload (50-5000) Source: INFO clients.connected_clients
ops_per_sec
What it measures: Instantaneous operations per second Why anomalies matter: Indicates traffic changes, application issues, or attacks Typical baseline: Varies widely (100-100000+) Source: INFO stats.instantaneous_ops_per_sec
memory_used
What it measures: Total allocated memory in bytes Why anomalies matter: Sudden increases may indicate memory leaks or data bloat Typical baseline: Depends on dataset size (100MB-100GB+) Source: INFO memory.used_memory Config: Custom thresholds (3 consecutive, 60s cooldown)
input_kbps
What it measures: Current input kilobytes per second Why anomalies matter: Large write operations or bulk imports Typical baseline: Varies by write load (1-10000 kbps) Source: INFO stats.instantaneous_input_kbps
output_kbps
What it measures: Current output kilobytes per second Why anomalies matter: Large read operations or data exports Typical baseline: Varies by read load (1-10000 kbps) Source: INFO stats.instantaneous_output_kbps
slowlog_count
What it measures: Current length of SLOWLOG Why anomalies matter: Indicates query performance degradation Typical baseline: 0-50 (depends on threshold configuration) Source: SLOWLOG LEN Config: Custom thresholds (2 consecutive, 30s cooldown)
acl_denied
What it measures: Sum of rejected connections and ACL auth denials Why anomalies matter: Security concern - possible brute force or misconfig Typical baseline: 0-5 (should be very low normally) Source: INFO stats.rejected_connections + stats.acl_access_denied_auth Config: Custom thresholds (WARNING: 10, CRITICAL: 50, 2 consecutive, 30s cooldown)
evicted_keys
What it measures: Total number of keys evicted due to maxmemory Why anomalies matter: Indicates memory pressure and cache thrashing Typical baseline: 0 (ideally), or consistent low rate Source: INFO stats.evicted_keys Config: Custom thresholds (2 consecutive, 30s cooldown)
blocked_clients
What it measures: Clients blocked on BLPOP, BRPOP, etc. Why anomalies matter: May indicate queue backup or deadlock Typical baseline: 0-10 (depends on usage of blocking commands) Source: INFO clients.blocked_clients
keyspace_misses
What it measures: Total number of failed key lookups Why anomalies matter: Poor cache hit rate impacts application performance Typical baseline: Varies widely (track hit ratio instead) Source: INFO stats.keyspace_misses
fragmentation_ratio
What it measures: mem_fragmentation_ratio from INFO Why anomalies matter: High fragmentation wastes memory and impacts performance Typical baseline: 1.0-1.3 (ideal) Source: INFO memory.mem_fragmentation_ratio Config: Custom thresholds (WARNING: 1.5, CRITICAL: 2.0, 5 consecutive, 120s cooldown)
Configuration
Environment Variables
Set these before starting BetterDB:
# Enable/disable anomaly detection (default: true)
ANOMALY_DETECTION_ENABLED=true
# Polling interval in milliseconds (default: 1000)
ANOMALY_POLL_INTERVAL_MS=1000
# Cache TTL for in-memory anomaly data (default: 3600000 = 1 hour)
ANOMALY_CACHE_TTL_MS=3600000
# Prometheus metrics update interval (default: 30000 = 30 seconds)
ANOMALY_PROMETHEUS_INTERVAL_MS=30000
Runtime Settings
You can adjust these settings without restarting via the /settings API:
curl -X PUT http://localhost:3001/settings \
-H "Content-Type: application/json" \
-d '{
"anomalyPollIntervalMs": 500,
"anomalyCacheTtlMs": 7200000,
"anomalyPrometheusIntervalMs": 15000
}'
Note: Changing anomalyPollIntervalMs affects detection sensitivity. Faster polling = quicker detection but higher overhead.
Disabling Detection
To completely disable anomaly detection:
# In .env or environment
ANOMALY_DETECTION_ENABLED=false
Or set at container runtime:
docker run -e ANOMALY_DETECTION_ENABLED=false betterdb/monitor
API Endpoints
Get Recent Anomaly Events
GET /api/anomaly/events?limit=100&severity=critical&metricType=connections
Query Parameters:
startTime(optional): Unix timestamp in millisecondsendTime(optional): Unix timestamp in millisecondsseverity(optional):info,warning, orcriticalmetricType(optional): Filter by specific metriclimit(optional): Max events to return (default: 100)
Response:
{
"events": [
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"timestamp": 1704067200000,
"metricType": "connections",
"anomalyType": "spike",
"severity": "warning",
"value": 450,
"baseline": 250,
"stdDev": 25,
"zScore": 8.0,
"threshold": 2.0,
"message": "WARNING: connections spike detected. Value: 450, Baseline: 250 (80.0% above normal, Z-score: 8.00)",
"correlationId": "760e9500-f30c-52e5-b827-557766551111",
"relatedMetrics": ["ops_per_sec"],
"resolved": false
}
]
}
Get Correlated Anomaly Groups
GET /api/anomaly/groups?limit=50&pattern=memory_pressure
Query Parameters:
startTime(optional): Unix timestamp in millisecondsendTime(optional): Unix timestamp in millisecondspattern(optional): Filter by pattern namelimit(optional): Max groups to return (default: 50)
Response:
{
"groups": [
{
"correlationId": "760e9500-f30c-52e5-b827-557766551111",
"timestamp": 1704067200000,
"pattern": "memory_pressure",
"severity": "critical",
"diagnosis": "Memory pressure detected with potential evictions",
"recommendations": [
"Check memory usage trends and plan for scaling",
"Review eviction policy settings",
"Identify large keys or data structures",
"Consider increasing maxmemory or adding shards"
],
"anomalies": [
{ "metricType": "memory_used", "severity": "critical", "zScore": 3.2 },
{ "metricType": "evicted_keys", "severity": "warning", "zScore": 2.5 }
]
}
]
}
Get Anomaly Summary
GET /api/anomaly/summary?startTime=1704067200000
Response:
{
"totalEvents": 42,
"totalGroups": 8,
"activeEvents": 3,
"resolvedEvents": 39,
"bySeverity": {
"info": 0,
"warning": 35,
"critical": 7
},
"byMetric": {
"connections": 12,
"memory_used": 8,
"ops_per_sec": 15,
"acl_denied": 7
},
"byPattern": {
"traffic_burst": 3,
"memory_pressure": 2,
"auth_attack": 1,
"unknown": 2
}
}
Get Buffer Statistics
GET /api/anomaly/buffers
Response:
{
"buffers": [
{
"metricType": "connections",
"sampleCount": 300,
"mean": 250.5,
"stdDev": 25.3,
"min": 180,
"max": 320,
"latest": 255,
"isReady": true
}
]
}
Resolve Anomaly or Group
POST /api/anomaly/resolve
Content-Type: application/json
{
"anomalyId": "550e8400-e29b-41d4-a716-446655440000"
}
Or resolve entire group:
POST /api/anomaly/resolve-group
Content-Type: application/json
{
"correlationId": "760e9500-f30c-52e5-b827-557766551111"
}
Clear Resolved Anomalies
DELETE /api/anomaly/resolved
Response:
{
"cleared": 39
}
Tuning Guide
Reducing False Positives
Problem: Too many warning alerts for normal variance
Solutions:
- Increase Z-score thresholds (requires code change, or wait for configurable detectors)
- Increase consecutive required samples (edit detector config in
anomaly.service.ts) - Lengthen cooldown periods (prevents repeat alerts)
- Increase poll interval - Less frequent sampling reduces noise
Example: Edit anomaly.service.ts configs:
[MetricType.CONNECTIONS]: {
warningZScore: 2.5, // Increase from 2.0
consecutiveRequired: 5, // Increase from 3
cooldownMs: 120000, // Increase from 60000
}
Increasing Sensitivity
Problem: Missing real issues because thresholds are too high
Solutions:
- Decrease Z-score thresholds (e.g., 1.5 for warning instead of 2.0)
- Reduce consecutive required samples (alert faster)
- Shorten cooldown periods (allow more frequent alerts)
- Add absolute thresholds for critical metrics
Example: Add absolute threshold for connections:
[MetricType.CONNECTIONS]: {
warningZScore: 1.8,
criticalThreshold: 1000, // Alert if > 1000 regardless of Z-score
}
Baseline Warmup Issues
Problem: Detection not working immediately after startup
Solution: Wait for warmup period
- Minimum: 30 samples = 30 seconds (at 1s poll interval)
- Optimal: 300 samples = 5 minutes (full buffer)
Check buffer readiness:
curl http://localhost:3001/api/anomaly/buffers | jq '.buffers[] | select(.isReady == false)'
Or via Prometheus:
betterdb_anomaly_buffer_ready == 0
Pattern Detection Not Working
Problem: Anomalies detected but not correlated into patterns
Debugging:
- Check correlation window (default: 5 seconds) - Anomalies must occur within this window
- Verify pattern requirements - Some patterns need specific metric combinations
- Review custom pattern
checkfunctions - They may have additional logic
Example: BATCH_JOB requires connections AND ops_per_sec AND memory_used to ALL spike within 5 seconds.
High Memory Usage
Problem: Anomaly detection using too much memory
Solutions:
- Reduce buffer size (default: 300 samples per metric × 11 metrics = 3300 samples)
- Reduce cache TTL (
ANOMALY_CACHE_TTL_MS) - Older events purged sooner - Reduce max recent events (default: 1000 events, 100 groups)
Example: Edit metric-buffer.ts:
constructor(
private readonly metricType: MetricType,
maxSamples: number = 150, // Reduce from 300 (2.5 min instead of 5 min)
minSamples: number = 20, // Reduce from 30
)
Integration with Alerting
Prometheus + Alertmanager
Step 1: Configure Prometheus to scrape BetterDB
scrape_configs:
- job_name: 'betterdb'
static_configs:
- targets: ['betterdb:3001']
metrics_path: '/prometheus/metrics'
scrape_interval: 15s
Step 2: Add alert rules (see docs/alertmanager-rules.yml)
groups:
- name: betterdb-anomaly-alerts
rules:
- alert: BetterDBCriticalAnomaly
expr: increase(betterdb_anomaly_events_total{severity="critical"}[5m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Critical anomaly detected"
Step 3: Configure Alertmanager routing
route:
receiver: 'default'
routes:
- match:
alertname: BetterDBCriticalAnomaly
receiver: 'pagerduty-critical'
continue: false
- match:
severity: warning
receiver: 'slack-warnings'
PagerDuty Integration
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '<your-service-key>'
description: ''
details:
pattern: ''
metric_type: ''
instance: ''
Slack Integration
receivers:
- name: 'slack-warnings'
slack_configs:
- api_url: '<your-webhook-url>'
channel: '#redis-alerts'
title: 'BetterDB Anomaly: '
text: >-
*Pattern*:
*Metric*:
*Severity*:
Custom Webhooks
Use Alertmanager’s webhook receiver:
receivers:
- name: 'custom-webhook'
webhook_configs:
- url: 'https://your-system.com/webhooks/betterdb'
send_resolved: true
Webhook payload includes:
- Alert name and labels (severity, pattern, metric_type)
- Annotations (summary, description)
- Alert state (firing/resolved)
- Timestamps
Grafana Alerts (Alternative)
Create alerts directly in Grafana:
- Navigate to Alerting → Alert Rules → New Alert Rule
- Set query:
increase(betterdb_anomaly_events_total{severity="critical"}[5m]) > 0 - Configure notification channel (Slack, PagerDuty, Email)
- Set evaluation interval: 1m
Troubleshooting
No Anomalies Being Detected
Check:
- Buffer readiness:
GET /api/anomaly/buffers- All buffers should showisReady: true - Polling active: Check Prometheus
betterdb_polls_totalis incrementing - Database connectivity:
GET /healthshould show database healthy - Actual variance: Your workload may be very stable (low stddev)
Solution: Artificially create load to test:
# Spike connections
redis-benchmark -h localhost -p 6379 -c 500 -n 100000
Anomalies Detected But Not Correlated
Check:
- Correlation interval: Events must occur within 5 seconds
- Pattern requirements: Some patterns need specific metric combinations
- Check
/api/anomaly/eventsvs/api/anomaly/groups
Solution: Look for events with correlationId: null - these haven’t been grouped yet.
Too Many False Positives
Check:
- Baseline stability: Very spiky workloads create high stddev
- Consecutive requirements: May be too low (default: 3)
- Cooldown periods: May be too short
Solution: See Tuning Guide above.
Prometheus Metrics Not Updating
Check:
- Prometheus summary interval: Default 30s, configurable via
ANOMALY_PROMETHEUS_INTERVAL_MS - Check
/prometheus/metricsendpoint directly - Verify Prometheus scrape config and target health
Solution:
# Check metrics directly
curl http://localhost:3001/prometheus/metrics | grep anomaly
# Check Prometheus targets
http://prometheus:9090/targets
High CPU Usage from Detection
Check:
- Poll interval: Default 1s may be too aggressive for slow networks
- Number of metrics: 11 metrics × polling + correlation overhead
Solution: Increase poll interval:
ANOMALY_POLL_INTERVAL_MS=2000 # Poll every 2 seconds instead of 1
Old Anomalies Not Being Cleared
Check:
- Cache TTL: Default 1 hour (
ANOMALY_CACHE_TTL_MS) - Storage backend: PostgreSQL/SQLite retains indefinitely
Solution: Manually clear resolved anomalies:
curl -X DELETE http://localhost:3001/api/anomaly/resolved
Or query historical data with time filters:
curl "http://localhost:3001/api/anomaly/events?startTime=$(date -d '1 hour ago' +%s)000"