Skip to content

Monitoring Guide

Observability for EdgeQuake Deployments

This guide covers monitoring, logging, and alerting for EdgeQuake in production environments.


┌─────────────────────────────────────────────────────────────────┐
│ OBSERVABILITY OVERVIEW │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ EdgeQuake │───▶│ Logs │───▶│ Log Aggr. │ │
│ │ Server │ │ (stdout) │ │ (Loki/ELK) │ │
│ └──────┬──────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ ├─────────▶ /health endpoints │
│ │ │
│ ├─────────▶ /metrics (planned) │
│ │ │
│ └─────────▶ PostgreSQL metrics │
│ │
└─────────────────────────────────────────────────────────────────┘

EdgeQuake provides built-in health endpoints:

EndpointPurposeResponse
GET /healthBasic liveness{ "status": "ok" }
GET /health/readyReadiness checkDatabase + LLM status
GET /health/liveKubernetes livenessProcess check
Terminal window
curl http://localhost:8080/health
{
"status": "ok",
"version": "0.1.0",
"storage_mode": "postgresql"
}
Terminal window
curl http://localhost:8080/health/ready
{
"status": "ready",
"checks": {
"database": "ok",
"llm_provider": "ok"
}
}

EdgeQuake uses structured JSON logging via the tracing crate:

{
"timestamp": "2024-01-15T10:30:00.000Z",
"level": "INFO",
"target": "edgequake_api::handlers::documents",
"message": "Document uploaded successfully",
"fields": {
"document_id": "doc_123",
"workspace_id": "ws_456",
"duration_ms": 1234
}
}
LevelRUST_LOG SettingUse Case
ErrorerrorCritical failures
WarnwarnDegraded but working
InfoinfoProduction operations
DebugdebugDevelopment debugging
TracetraceDetailed tracing
Terminal window
# Production
RUST_LOG="edgequake=info,tower_http=info,sqlx=warn"
# Development
RUST_LOG="edgequake=debug,tower_http=debug"
# Troubleshooting
RUST_LOG="edgequake=trace,sqlx=debug"
Terminal window
# Pipeline debugging
RUST_LOG="edgequake_pipeline=debug"
# Query engine debugging
RUST_LOG="edgequake_query=debug"
# API request tracing
RUST_LOG="tower_http=debug"
# Database query logging
RUST_LOG="sqlx=debug"

Docker Compose addition:

services:
loki:
image: grafana/loki:2.9.0
ports:
- "3100:3100"
volumes:
- ./loki-config.yaml:/etc/loki/local-config.yaml
promtail:
image: grafana/promtail:2.9.0
volumes:
- /var/log:/var/log
- ./promtail-config.yaml:/etc/promtail/config.yml
command: -config.file=/etc/promtail/config.yml
grafana:
image: grafana/grafana:10.0.0
ports:
- "3001:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin

Filebeat configuration:

filebeat.inputs:
- type: container
paths:
- "/var/lib/docker/containers/*/*.log"
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/lib/docker/containers/"
output.elasticsearch:
hosts: ["elasticsearch:9200"]

MetricSourceAlert Threshold
Request latencyLogsp99 > 2s
Error rateLogs> 1%
Active connectionsPostgreSQL> 80% pool
Background task queueLogs> 100 pending
MetricQueryAlert Threshold
Connection countpg_stat_activity> 80% max
Cache hit ratiopg_stat_database< 95%
Index usagepg_stat_user_indexesUnused indexes
Table bloatpgstattuple> 30%
MetricSourceAlert Threshold
Token usageProvider APIBudget threshold
Error rateLogs> 5%
LatencyLogs> 10s
Rate limitsProvider APINear limit

groups:
- name: edgequake
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: High error rate detected
- alert: SlowQueries
expr: histogram_quantile(0.99, query_duration_seconds_bucket) > 2
for: 10m
labels:
severity: warning
annotations:
summary: Query latency above 2s
- alert: DatabaseConnectionsHigh
expr: pg_stat_activity_count > 80
for: 5m
labels:
severity: warning
annotations:
summary: PostgreSQL connections high

  1. Request Overview

    • Requests per second
    • Error rate
    • Latency percentiles (p50, p95, p99)
  2. Document Processing

    • Documents indexed per minute
    • Processing time distribution
    • Queue depth
  3. Query Performance

    • Query latency by mode
    • Context retrieval time
    • LLM generation time
  4. Resource Usage

    • CPU usage
    • Memory usage
    • PostgreSQL connections
    • Disk I/O
# Loki query for request latency
{app="edgequake"} |= "request completed" | json | duration_ms > 1000

OpenTelemetry integration is planned:

// Future: Distributed tracing
#[tracing::instrument]
async fn process_query(query: &str) -> Result<Response> {
// Automatic span creation
let chunks = retrieve_chunks(query).await?;
let response = generate_response(chunks).await?;
Ok(response)
}

-- Active connections
SELECT count(*) as connections,
state,
wait_event_type
FROM pg_stat_activity
WHERE datname = 'edgequake'
GROUP BY state, wait_event_type;
-- Long-running queries
SELECT pid,
now() - pg_stat_activity.query_start AS duration,
query
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'
AND state != 'idle';
-- Table sizes
SELECT schemaname,
relname,
pg_size_pretty(pg_total_relation_size(relid)) as total_size
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;
-- Index usage
SELECT schemaname,
relname,
indexrelname,
idx_scan,
idx_tup_read
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC
LIMIT 10;
-- Vector index stats (pgvector)
SELECT indexname,
pg_size_pretty(pg_relation_size(indexname::regclass)) as size
FROM pg_indexes
WHERE indexdef LIKE '%vector%';
-- Chunk count per workspace
SELECT workspace_id,
count(*) as chunk_count
FROM chunks
GROUP BY workspace_id
ORDER BY chunk_count DESC;
-- Entity count
SELECT count(*) FROM ag_catalog.cypher('edgequake_graph', $$
MATCH (n) RETURN count(n)
$$) AS (count agtype);
-- Relationship count
SELECT count(*) FROM ag_catalog.cypher('edgequake_graph', $$
MATCH ()-[r]->() RETURN count(r)
$$) AS (count agtype);

  1. Check background task queue
  2. Review connection pool size
  3. Analyze PostgreSQL memory settings
Terminal window
# Check process memory
ps aux | grep edgequake
# Check PostgreSQL memory
psql -c "SHOW shared_buffers; SHOW work_mem;"
  1. Enable query logging
Terminal window
RUST_LOG="edgequake_query=debug,sqlx=debug"
  1. Check PostgreSQL slow query log
-- Enable slow query logging
ALTER SYSTEM SET log_min_duration_statement = 1000; -- 1 second
SELECT pg_reload_conf();
  1. Check provider status
  2. Review rate limits
  3. Verify API keys
Terminal window
# Test OpenAI connectivity
curl https://api.openai.com/v1/models \
-H "Authorization: Bearer $OPENAI_API_KEY"
# Test Ollama connectivity
curl http://localhost:11434/api/tags

Terminal window
# Check last backup time
pg_dump --version-only edgequake
# Verify backup size
ls -lh /backups/edgequake-*.sql.gz
- alert: BackupTooOld
expr: time() - backup_last_success_timestamp > 86400
for: 1h
labels:
severity: critical
annotations:
summary: No successful backup in 24 hours

┌─────────────────────────────────────────────────────────────────┐
│ MONITORING CHECKLIST │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ✅ Health endpoints configured for load balancer │
│ ✅ Structured logging enabled │
│ ✅ Log aggregation set up (Loki/ELK) │
│ ✅ Key metrics identified and dashboarded │
│ ✅ Alert rules defined for critical conditions │
│ ✅ PostgreSQL monitoring enabled │
│ ✅ LLM provider usage tracked │
│ ✅ Backup verification automated │
│ │
└─────────────────────────────────────────────────────────────────┘