Performance Tuning Guide
Performance Tuning Guide
Section titled “Performance Tuning Guide”Optimizing EdgeQuake for Production Workloads
This guide covers performance tuning strategies for EdgeQuake deployments.
Performance Overview
Section titled “Performance Overview”┌─────────────────────────────────────────────────────────────────┐│ PERFORMANCE BOTTLENECKS │├─────────────────────────────────────────────────────────────────┤│ ││ Request Latency Breakdown (typical hybrid query): ││ ││ ┌──────────────────────────────────────────────────────────┐ ││ │ Phase │ Time │ Bottleneck │ ││ ├──────────────────────────────────────────────────────────┤ ││ │ Embedding │ 50ms │ LLM API latency │ ││ │ Vector Search │ 20ms │ pgvector index │ ││ │ Graph Traverse │ 30ms │ Apache AGE queries │ ││ │ LLM Generation │ 2000ms │ Token generation (dominant) │ ││ │ Network/Parse │ 50ms │ Serialization │ ││ ├──────────────────────────────────────────────────────────┤ ││ │ TOTAL │ ~2150ms │ LLM is 93% of latency │ ││ └──────────────────────────────────────────────────────────┘ ││ ││ Key Insight: Optimizing LLM selection has largest impact ││ │└─────────────────────────────────────────────────────────────────┘Quick Wins
Section titled “Quick Wins”1. Choose Faster LLM Models
Section titled “1. Choose Faster LLM Models”| Model | Latency (TTFT) | Cost | Quality |
|---|---|---|---|
| gpt-4o | 500ms | $$$$ | Excellent |
| gpt-4o-mini | 200ms | $ | Very Good |
| gemma3:12b (Ollama) | 100ms | Free | Good |
| llama3.2:3b (Ollama) | 50ms | Free | Moderate |
Recommendation: Use gpt-4o-mini for production (best latency/quality ratio)
export EDGEQUAKE_LLM_MODEL=gpt-4o-mini2. Reduce Context Size
Section titled “2. Reduce Context Size”Smaller context = faster LLM processing:
# Query with fewer chunkscurl -X POST http://localhost:8080/api/v1/query \ -d '{"query": "...", "max_chunks": 5, "max_entities": 5}'Default vs Optimized:
| Setting | Default | Optimized |
|---|---|---|
max_chunks | 20 | 5-10 |
max_entities | 10 | 3-5 |
max_relationships | 20 | 5-10 |
3. Use Appropriate Query Mode
Section titled “3. Use Appropriate Query Mode”| Mode | Speed | Use Case |
|---|---|---|
naive | Fastest | Simple factual queries |
local | Fast | Entity-focused queries |
hybrid | Medium | General queries |
global | Slow | Overview/theme queries |
# Fast mode for simple queriescurl -X POST http://localhost:8080/api/v1/query \ -d '{"query": "What is X?", "mode": "naive"}'Document Processing Optimization
Section titled “Document Processing Optimization”Worker Configuration
Section titled “Worker Configuration”# Default: Uses all CPU cores# For I/O bound workloads (LLM API calls), use 2x coresexport WORKER_THREADS=8 # For 4-core machineChunk Size Tuning
Section titled “Chunk Size Tuning”┌─────────────────────────────────────────────────────────────────┐│ CHUNK SIZE TRADEOFFS├─────────────────────────────────────────────────────────────────┤││ Small chunks (256 tokens):│ ✅ More precise retrieval│ ✅ Lower token cost per extraction│ ❌ More LLM calls (slower processing)│ ❌ Less context per chunk││ Large chunks (1024 tokens):│ ✅ Fewer LLM calls (faster processing)│ ✅ Better context preservation│ ❌ Less precise retrieval│ ❌ Higher token cost per extraction││ Recommendation: 1200 tokens (default, balanced)│└─────────────────────────────────────────────────────────────────┘Batch Processing
Section titled “Batch Processing”For bulk uploads, process in batches:
# Upload via batch endpoint (more efficient)curl -X POST http://localhost:8080/api/v1/documents/upload/batch \ -F "files=@doc1.pdf" \ -F "files=@doc2.pdf" \ -F "files=@doc3.pdf"Database Optimization
Section titled “Database Optimization”PostgreSQL Configuration
Section titled “PostgreSQL Configuration”postgresql.conf tuning for EdgeQuake:
# Memory (adjust for your RAM)shared_buffers = 4GB # 25% of RAMeffective_cache_size = 12GB # 75% of RAMwork_mem = 256MB # For complex queriesmaintenance_work_mem = 1GB # For indexing
# Connectionsmax_connections = 200 # Match app pool size
# Write Ahead Logwal_buffers = 64MBcheckpoint_completion_target = 0.9
# Query Planningrandom_page_cost = 1.1 # For SSD storageeffective_io_concurrency = 200 # For SSD storage
# Parallel Querymax_parallel_workers_per_gather = 4max_parallel_workers = 8Connection Pooling
Section titled “Connection Pooling”Use PgBouncer for high-concurrency:
[databases]edgequake = host=localhost port=5432 dbname=edgequake
[pgbouncer]pool_mode = transactionmax_client_conn = 1000default_pool_size = 50reserve_pool_size = 10Connection String:
# Via PgBouncer (port 6432)DATABASE_URL="postgresql://user:pass@localhost:6432/edgequake"pgvector Index Tuning
Section titled “pgvector Index Tuning”-- Check current index\d embeddings
-- Optimal HNSW parameters for performanceCREATE INDEX CONCURRENTLY embeddings_vector_idxON embeddingsUSING hnsw (embedding vector_cosine_ops)WITH (m = 16, ef_construction = 64);
-- For higher recall (slower)-- WITH (m = 32, ef_construction = 128);Search Quality vs Speed:
| ef_search | Recall | Latency |
|---|---|---|
| 40 | 95% | 10ms |
| 100 | 98% | 20ms |
| 200 | 99% | 40ms |
-- Set search quality at runtimeSET hnsw.ef_search = 100;Apache AGE Tuning
Section titled “Apache AGE Tuning”-- Ensure graph is loaded in memorySET search_path = ag_catalog, "$user", public;LOAD 'age';
-- Index commonly filtered propertiesSELECT create_vlabel('edgequake_graph', 'Entity');SELECT create_elabel('edgequake_graph', 'Relationship');Query Optimization
Section titled “Query Optimization”Embedding Caching
Section titled “Embedding Caching”EdgeQuake caches embeddings for repeated queries:
┌─────────────────────────────────────────────────────────────────┐│ QUERY CACHING │├─────────────────────────────────────────────────────────────────┤│ ││ Query "What is X?" ──→ [Embedding Cache] ──→ Vector Search ││ │ ││ Cache Hit: 0ms ││ Cache Miss: 50ms ││ ││ Cache is in-memory, cleared on restart ││ │└─────────────────────────────────────────────────────────────────┘Reranking Strategy
Section titled “Reranking Strategy”Reranking improves quality but adds latency:
# Disable reranking for faster queriescurl -X POST http://localhost:8080/api/v1/query \ -d '{"query": "...", "enable_rerank": false}'
# Or use smaller rerank setcurl -X POST http://localhost:8080/api/v1/query \ -d '{"query": "...", "rerank_top_k": 3}'| Reranking | Latency | Quality |
|---|---|---|
| Disabled | -100ms | Baseline |
| Top 3 | +30ms | +5% |
| Top 5 | +50ms | +8% |
| Top 10 | +100ms | +10% |
Query Prefetching
Section titled “Query Prefetching”For chat applications, prefetch likely follow-up queries:
// Client-side optimizationasync function queryWithPrefetch(query) { const response = await fetch("/api/v1/query", { method: "POST", body: JSON.stringify({ query }), });
// Prefetch entity expansions in background const entities = extractEntities(await response.json()); entities.slice(0, 3).forEach((entity) => { fetch(`/api/v1/graph/entities/${entity}/neighborhood`); });}LLM Provider Optimization
Section titled “LLM Provider Optimization”OpenAI Optimization
Section titled “OpenAI Optimization”# Use streaming for faster time-to-first-tokencurl -X POST http://localhost:8080/api/v1/query/stream \ -H "Accept: text/event-stream" \ -d '{"query": "..."}'Ollama Optimization
Section titled “Ollama Optimization”GPU Acceleration:
# Ensure CUDA is availablenvidia-smi
# Set GPU layers (more = faster, more VRAM)export OLLAMA_NUM_GPU=50ollama serveModel Quantization:
| Quantization | Speed | Quality | VRAM |
|---|---|---|---|
| Q4_K_M | Fastest | Good | 4GB |
| Q5_K_M | Fast | Better | 5GB |
| Q8_0 | Slow | Best | 8GB |
| FP16 | Slowest | Reference | 16GB |
# Download quantized modelollama pull gemma3:12b-q4_K_MLocal vs Cloud Latency
Section titled “Local vs Cloud Latency”┌─────────────────────────────────────────────────────────────────┐│ LATENCY COMPARISON │├─────────────────────────────────────────────────────────────────┤│ ││ Local Ollama (RTX 4090): ││ ┌──────────────────────────────────────────────────────────┐ ││ │ Time to First Token: 50ms │ ││ │ Token Generation: 100 tokens/sec │ ││ │ Total (500 tokens): 5.05s │ ││ └──────────────────────────────────────────────────────────┘ ││ ││ OpenAI gpt-4o-mini: ││ ┌──────────────────────────────────────────────────────────┐ ││ │ Time to First Token: 200ms │ ││ │ Token Generation: 80 tokens/sec │ ││ │ Network overhead: 50ms │ ││ │ Total (500 tokens): 6.5s │ ││ └──────────────────────────────────────────────────────────┘ ││ ││ Verdict: Local GPU is faster for inference-heavy workloads ││ │└─────────────────────────────────────────────────────────────────┘Scaling Strategies
Section titled “Scaling Strategies”Horizontal Scaling
Section titled “Horizontal Scaling”┌─────────────────────────────────────────────────────────────────┐│ HORIZONTAL ARCHITECTURE │├─────────────────────────────────────────────────────────────────┤│ ││ Load Balancer ││ │ ││ ┌─────────────────┼─────────────────┐ ││ ↓ ↓ ↓ ││ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ││ │ EdgeQuake 1 │ │ EdgeQuake 2 │ │ EdgeQuake 3 │ ││ │ (Queries) │ │ (Queries) │ │ (Processing)│ ││ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ ││ │ │ │ ││ └─────────────────┼─────────────────┘ ││ ↓ ││ ┌─────────────┐ ││ │ PostgreSQL │ ││ │ + Replicas │ ││ └─────────────┘ ││ │└─────────────────────────────────────────────────────────────────┘Kubernetes HPA:
apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: edgequake-hpaspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: edgequake minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70Read Replicas
Section titled “Read Replicas”Separate read and write workloads:
# Primary for writesDATABASE_URL="postgresql://user:pass@primary:5432/edgequake"
# Replica for reads (queries)DATABASE_READ_URL="postgresql://user:pass@replica:5432/edgequake"Monitoring Performance
Section titled “Monitoring Performance”Key Metrics
Section titled “Key Metrics”| Metric | Target | Alert |
|---|---|---|
| p50 query latency | <2s | >5s |
| p99 query latency | <10s | >30s |
| Processing throughput | >1 doc/min | <0.5 doc/min |
| Error rate | <1% | >5% |
| DB connection pool | <80% | >90% |
Prometheus Queries
Section titled “Prometheus Queries”# Query latency percentileshistogram_quantile(0.99, rate(edgequake_query_duration_seconds_bucket[5m]))
# Processing throughputrate(edgequake_documents_processed_total[5m])
# Error raterate(edgequake_query_errors_total[5m]) / rate(edgequake_query_total[5m])Benchmarking
Section titled “Benchmarking”# Run built-in benchmarkscargo bench
# Results:# vector_search 10.2 ms/iter# graph_traverse 5.1 ms/iter# entity_extraction 150 ms/iter (mock LLM)Performance Checklist
Section titled “Performance Checklist”Pre-Optimization
Section titled “Pre-Optimization”- Baseline metrics recorded
- Bottleneck identified (usually LLM)
- Resource monitoring in place
Quick Wins
Section titled “Quick Wins”- Using gpt-4o-mini (or faster model)
- Context size reduced (max_chunks ≤ 10)
- Appropriate query mode selected
- Streaming enabled for chat
Database
Section titled “Database”- PostgreSQL tuned for RAM
- pgvector HNSW index created
- Connection pooling enabled
- Read replicas for high load
Scaling
Section titled “Scaling”- Horizontal scaling configured
- Auto-scaling rules defined
- Load testing completed
- Graceful degradation planned
Troubleshooting Slow Queries
Section titled “Troubleshooting Slow Queries”Debug Query Timing
Section titled “Debug Query Timing”# Add timing to responsecurl -X POST http://localhost:8080/api/v1/query \ -d '{"query": "...", "debug": true}'Response:
{ "answer": "...", "stats": { "embedding_time_ms": 45, "retrieval_time_ms": 123, "generation_time_ms": 2890, "total_time_ms": 3058 }}Common Causes
Section titled “Common Causes”| Symptom | Cause | Fix |
|---|---|---|
| Slow embedding | Cold start | Warm up with test query |
| Slow retrieval | Missing index | Create HNSW index |
| Slow generation | Large context | Reduce max_chunks |
| Slow generation | Slow model | Switch to faster model |
| High latency variance | Connection pool | Enable PgBouncer |
See Also
Section titled “See Also”- Configuration Reference - All settings
- Deployment Guide - Production setup
- Monitoring Guide - Observability