Monitoring Guide
Monitoring Guide
Section titled “Monitoring Guide”Observability for EdgeQuake Deployments
This guide covers monitoring, logging, and alerting for EdgeQuake in production environments.
Observability Stack
Section titled “Observability Stack”┌─────────────────────────────────────────────────────────────────┐│ OBSERVABILITY OVERVIEW │├─────────────────────────────────────────────────────────────────┤│ ││ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ││ │ EdgeQuake │───▶│ Logs │───▶│ Log Aggr. │ ││ │ Server │ │ (stdout) │ │ (Loki/ELK) │ ││ └──────┬──────┘ └─────────────┘ └─────────────┘ ││ │ ││ ├─────────▶ /health endpoints ││ │ ││ ├─────────▶ /metrics (planned) ││ │ ││ └─────────▶ PostgreSQL metrics ││ │└─────────────────────────────────────────────────────────────────┘Health Endpoints
Section titled “Health Endpoints”EdgeQuake provides built-in health endpoints:
| Endpoint | Purpose | Response |
|---|---|---|
GET /health | Basic liveness | { "status": "ok" } |
GET /health/ready | Readiness check | Database + LLM status |
GET /health/live | Kubernetes liveness | Process check |
Basic Health Check
Section titled “Basic Health Check”curl http://localhost:8080/health{ "status": "ok", "version": "0.1.0", "storage_mode": "postgresql"}Readiness Check
Section titled “Readiness Check”curl http://localhost:8080/health/ready{ "status": "ready", "checks": { "database": "ok", "llm_provider": "ok" }}Logging
Section titled “Logging”Log Format
Section titled “Log Format”EdgeQuake uses structured JSON logging via the tracing crate:
{ "timestamp": "2024-01-15T10:30:00.000Z", "level": "INFO", "target": "edgequake_api::handlers::documents", "message": "Document uploaded successfully", "fields": { "document_id": "doc_123", "workspace_id": "ws_456", "duration_ms": 1234 }}Log Levels
Section titled “Log Levels”| Level | RUST_LOG Setting | Use Case |
|---|---|---|
| Error | error | Critical failures |
| Warn | warn | Degraded but working |
| Info | info | Production operations |
| Debug | debug | Development debugging |
| Trace | trace | Detailed tracing |
Recommended Production Settings
Section titled “Recommended Production Settings”# ProductionRUST_LOG="edgequake=info,tower_http=info,sqlx=warn"
# DevelopmentRUST_LOG="edgequake=debug,tower_http=debug"
# TroubleshootingRUST_LOG="edgequake=trace,sqlx=debug"Component-Specific Logging
Section titled “Component-Specific Logging”# Pipeline debuggingRUST_LOG="edgequake_pipeline=debug"
# Query engine debuggingRUST_LOG="edgequake_query=debug"
# API request tracingRUST_LOG="tower_http=debug"
# Database query loggingRUST_LOG="sqlx=debug"Log Aggregation
Section titled “Log Aggregation”Loki + Grafana
Section titled “Loki + Grafana”Docker Compose addition:
services: loki: image: grafana/loki:2.9.0 ports: - "3100:3100" volumes: - ./loki-config.yaml:/etc/loki/local-config.yaml
promtail: image: grafana/promtail:2.9.0 volumes: - /var/log:/var/log - ./promtail-config.yaml:/etc/promtail/config.yml command: -config.file=/etc/promtail/config.yml
grafana: image: grafana/grafana:10.0.0 ports: - "3001:3000" environment: - GF_SECURITY_ADMIN_PASSWORD=adminELK Stack
Section titled “ELK Stack”Filebeat configuration:
filebeat.inputs: - type: container paths: - "/var/lib/docker/containers/*/*.log" processors: - add_kubernetes_metadata: host: ${NODE_NAME} matchers: - logs_path: logs_path: "/var/lib/docker/containers/"
output.elasticsearch: hosts: ["elasticsearch:9200"]Key Metrics to Monitor
Section titled “Key Metrics to Monitor”Application Metrics
Section titled “Application Metrics”| Metric | Source | Alert Threshold |
|---|---|---|
| Request latency | Logs | p99 > 2s |
| Error rate | Logs | > 1% |
| Active connections | PostgreSQL | > 80% pool |
| Background task queue | Logs | > 100 pending |
PostgreSQL Metrics
Section titled “PostgreSQL Metrics”| Metric | Query | Alert Threshold |
|---|---|---|
| Connection count | pg_stat_activity | > 80% max |
| Cache hit ratio | pg_stat_database | < 95% |
| Index usage | pg_stat_user_indexes | Unused indexes |
| Table bloat | pgstattuple | > 30% |
LLM Provider Metrics
Section titled “LLM Provider Metrics”| Metric | Source | Alert Threshold |
|---|---|---|
| Token usage | Provider API | Budget threshold |
| Error rate | Logs | > 5% |
| Latency | Logs | > 10s |
| Rate limits | Provider API | Near limit |
Alerting Rules
Section titled “Alerting Rules”Prometheus Alerting (Example)
Section titled “Prometheus Alerting (Example)”groups: - name: edgequake rules: - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01 for: 5m labels: severity: critical annotations: summary: High error rate detected
- alert: SlowQueries expr: histogram_quantile(0.99, query_duration_seconds_bucket) > 2 for: 10m labels: severity: warning annotations: summary: Query latency above 2s
- alert: DatabaseConnectionsHigh expr: pg_stat_activity_count > 80 for: 5m labels: severity: warning annotations: summary: PostgreSQL connections highDashboard Examples
Section titled “Dashboard Examples”Key Panels for Grafana
Section titled “Key Panels for Grafana”-
Request Overview
- Requests per second
- Error rate
- Latency percentiles (p50, p95, p99)
-
Document Processing
- Documents indexed per minute
- Processing time distribution
- Queue depth
-
Query Performance
- Query latency by mode
- Context retrieval time
- LLM generation time
-
Resource Usage
- CPU usage
- Memory usage
- PostgreSQL connections
- Disk I/O
Sample Query Panel
Section titled “Sample Query Panel”# Loki query for request latency{app="edgequake"} |= "request completed" | json | duration_ms > 1000Tracing (Future)
Section titled “Tracing (Future)”OpenTelemetry integration is planned:
// Future: Distributed tracing#[tracing::instrument]async fn process_query(query: &str) -> Result<Response> { // Automatic span creation let chunks = retrieve_chunks(query).await?; let response = generate_response(chunks).await?; Ok(response)}PostgreSQL Monitoring
Section titled “PostgreSQL Monitoring”Essential Views
Section titled “Essential Views”-- Active connectionsSELECT count(*) as connections, state, wait_event_typeFROM pg_stat_activityWHERE datname = 'edgequake'GROUP BY state, wait_event_type;
-- Long-running queriesSELECT pid, now() - pg_stat_activity.query_start AS duration, queryFROM pg_stat_activityWHERE (now() - pg_stat_activity.query_start) > interval '5 minutes' AND state != 'idle';
-- Table sizesSELECT schemaname, relname, pg_size_pretty(pg_total_relation_size(relid)) as total_sizeFROM pg_catalog.pg_statio_user_tablesORDER BY pg_total_relation_size(relid) DESCLIMIT 10;
-- Index usageSELECT schemaname, relname, indexrelname, idx_scan, idx_tup_readFROM pg_stat_user_indexesORDER BY idx_scan ASCLIMIT 10;Vector Storage Metrics
Section titled “Vector Storage Metrics”-- Vector index stats (pgvector)SELECT indexname, pg_size_pretty(pg_relation_size(indexname::regclass)) as sizeFROM pg_indexesWHERE indexdef LIKE '%vector%';
-- Chunk count per workspaceSELECT workspace_id, count(*) as chunk_countFROM chunksGROUP BY workspace_idORDER BY chunk_count DESC;Graph Storage Metrics
Section titled “Graph Storage Metrics”-- Entity countSELECT count(*) FROM ag_catalog.cypher('edgequake_graph', $$ MATCH (n) RETURN count(n)$$) AS (count agtype);
-- Relationship countSELECT count(*) FROM ag_catalog.cypher('edgequake_graph', $$ MATCH ()-[r]->() RETURN count(r)$$) AS (count agtype);Troubleshooting
Section titled “Troubleshooting”High Memory Usage
Section titled “High Memory Usage”- Check background task queue
- Review connection pool size
- Analyze PostgreSQL memory settings
# Check process memoryps aux | grep edgequake
# Check PostgreSQL memorypsql -c "SHOW shared_buffers; SHOW work_mem;"Slow Queries
Section titled “Slow Queries”- Enable query logging
RUST_LOG="edgequake_query=debug,sqlx=debug"- Check PostgreSQL slow query log
-- Enable slow query loggingALTER SYSTEM SET log_min_duration_statement = 1000; -- 1 secondSELECT pg_reload_conf();LLM Errors
Section titled “LLM Errors”- Check provider status
- Review rate limits
- Verify API keys
# Test OpenAI connectivitycurl https://api.openai.com/v1/models \ -H "Authorization: Bearer $OPENAI_API_KEY"
# Test Ollama connectivitycurl http://localhost:11434/api/tagsBackup Monitoring
Section titled “Backup Monitoring”PostgreSQL Backups
Section titled “PostgreSQL Backups”# Check last backup timepg_dump --version-only edgequake
# Verify backup sizels -lh /backups/edgequake-*.sql.gzBackup Alert Rule
Section titled “Backup Alert Rule”- alert: BackupTooOld expr: time() - backup_last_success_timestamp > 86400 for: 1h labels: severity: critical annotations: summary: No successful backup in 24 hoursSummary
Section titled “Summary”┌─────────────────────────────────────────────────────────────────┐│ MONITORING CHECKLIST │├─────────────────────────────────────────────────────────────────┤│ ││ ✅ Health endpoints configured for load balancer ││ ✅ Structured logging enabled ││ ✅ Log aggregation set up (Loki/ELK) ││ ✅ Key metrics identified and dashboarded ││ ✅ Alert rules defined for critical conditions ││ ✅ PostgreSQL monitoring enabled ││ ✅ LLM provider usage tracked ││ ✅ Backup verification automated ││ │└─────────────────────────────────────────────────────────────────┘See Also
Section titled “See Also”- Deployment Guide - Production deployment
- Configuration Reference - All settings
- Troubleshooting Guide - Problem solving