Skip to content

Deep Dive: Embedding Models

Understanding Vector Embeddings in EdgeQuake

This guide covers how embedding models work in EdgeQuake, how to choose the right one, and optimization strategies.


Embeddings are dense vector representations of text that capture semantic meaning. Similar concepts have similar vectors.

┌─────────────────────────────────────────────────────────────────┐
│ EMBEDDING VISUALIZATIO │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Text: "The cat sat on the mat" │
│ ↓ │
│ ┌─────────────────┐ │
│ │ Embedding Model │ │
│ └────────┬────────┘ │
│ ↓ │
│ Vector: [0.23, -0.15, 0.87, 0.42, ..., -0.31] (1536 dims) │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Semantic Space (2D projection) │ │
│ │ │ │
│ │ cat● ●dog │ │
│ │ ↖ ↗ │ │
│ │ kitten● ← similar → ●puppy │ │
│ │ │ │
│ │ │ │
│ │ car● ●truck │ │
│ │ ↖ ↗ │ │
│ │ vehicle● │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ Similar concepts cluster together in vector space │
│ │
└─────────────────────────────────────────────────────────────────┘

EdgeQuake uses embeddings at multiple stages:

┌─────────────────────────────────────────────────────────────────┐
│ EMBEDDING USAGE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. DOCUMENT PROCESSING │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Document → Chunks → [Embed] → Store in pgvector │ │
│ │ │ │
│ │ Entities → [Embed] → Store in pgvector │ │
│ │ │ │
│ │ Relationships → [Embed] → Store in pgvector │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ 2. QUERY PROCESSING │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Query → [Embed] → Vector Search → Top-K results │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ 3. ENTITY MATCHING │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ New Entity → [Embed] → Similar Entity Search → Merge? │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘

ModelDimensionsMax TokensCost/1KQuality
text-embedding-3-small15368191$0.00002Good
text-embedding-3-large30728191$0.00013Excellent
text-embedding-ada-00215368191$0.00010Good (legacy)

Recommendation: Use text-embedding-3-small for most use cases (best cost/performance).

Terminal window
export EDGEQUAKE_EMBEDDING_MODEL="text-embedding-3-small"
ModelDimensionsMax TokensCostQuality
nomic-embed-text7688192FreeGood
mxbai-embed-large1024512FreeVery Good
all-minilm384256FreeModerate
snowflake-arctic-embed1024512FreeVery Good

Recommendation: Use nomic-embed-text for local deployment (best quality/speed).

Terminal window
ollama pull nomic-embed-text
export EDGEQUAKE_EMBEDDING_PROVIDER="ollama"
export EDGEQUAKE_EMBEDDING_MODEL="nomic-embed-text"

┌─────────────────────────────────────────────────────────────────┐
│ DIMENSION TRADEOFFS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Lower Dimensions (384-768): │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ ✅ Faster similarity search
│ │ ✅ Less storage space
│ │ ✅ Lower memory usage
│ │ ❌ Less semantic precision
│ │ ❌ May miss subtle distinctions
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ Higher Dimensions (1536-3072): │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ ✅ Better semantic precision
│ │ ✅ Captures subtle nuances
│ │ ✅ Better for specialized domains
│ │ ❌ Slower similarity search
│ │ ❌ More storage required
│ │ ❌ Higher memory usage
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ Storage Impact (100K embeddings): │
│ • 384 dims: 153 MB │
│ • 768 dims: 307 MB │
│ • 1536 dims: 614 MB │
│ • 3072 dims: 1.2 GB │
│ │
└─────────────────────────────────────────────────────────────────┘

EdgeQuake’s embedding system is built on a trait abstraction:

#[async_trait]
pub trait EmbeddingProvider: Send + Sync {
/// Get the name of this provider.
fn name(&self) -> &str;
/// Get the embedding model.
fn model(&self) -> &str;
/// Get the dimension of the embeddings.
fn dimension(&self) -> usize;
/// Get the maximum number of tokens per input.
fn max_tokens(&self) -> usize;
/// Generate embeddings for a batch of texts.
async fn embed(&self, texts: &[String]) -> Result<Vec<Vec<f32>>>;
/// Generate embedding for a single text.
async fn embed_one(&self, text: &str) -> Result<Vec<f32>>;
}

Why Trait-Based Design:

  • Testing: MockProvider returns deterministic embeddings
  • Flexibility: Swap providers without code changes
  • Cost control: Route to different providers based on request
  • Resilience: Fallback providers when primary unavailable

EdgeQuake uses cosine similarity by default with pgvector:

-- Cosine similarity (default, best for text)
SELECT * FROM embeddings
ORDER BY embedding <=> query_embedding
LIMIT 10;
-- Also available:
-- Inner product: <#>
-- L2 distance: <->

Why Cosine Similarity:

  • Normalized vectors (magnitude doesn’t matter)
  • Works well for text embeddings
  • Range: -1 to 1 (1 = identical)
  • pgvector optimized for this metric

┌─────────────────────────────────────────────────────────────────┐
│ EMBEDDING PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Input: "Artificial intelligence is transforming..." │
│ ↓ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ 1. TOKENIZATION │ │
│ │ Split text into tokens: ["Artificial", "intelligence", │ │
│ │ "is", "transforming", ...] │ │
│ │ Check: tokens < max_tokens (8191 for OpenAI) │ │
│ └────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ 2. BATCHING │ │
│ │ Group texts for efficient API calls │ │
│ │ OpenAI: up to 2048 texts per batch │ │
│ │ Ollama: 1 text per call (no batching) │ │
│ └────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ 3. API CALL │ │
│ │ POST /v1/embeddings │ │
│ │ {"input": texts, "model": "text-embedding-3-small"} │ │
│ └────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ 4. NORMALIZATION │ │
│ │ Ensure unit length: ||v|| = 1 │ │
│ │ (OpenAI returns pre-normalized, Ollama may not) │ │
│ └────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ 5. STORAGE │ │
│ │ INSERT INTO embeddings (id, embedding, ...) │ │
│ │ VALUES ($1, $2::vector, ...) │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘

RequirementRecommended Model
Lowest costnomic-embed-text (Ollama, free)
Best qualitytext-embedding-3-large (OpenAI)
Best valuetext-embedding-3-small (OpenAI)
Privacy (local)nomic-embed-text (Ollama)
Low latencyall-minilm (Ollama, 384 dims)
High recalltext-embedding-3-large (3072 dims)
DomainRecommendation
General knowledgetext-embedding-3-small
Legal/medicaltext-embedding-3-large (precision matters)
Multi-languagetext-embedding-3-small (good multilingual)
Code/technicaltext-embedding-3-small + domain chunks

Terminal window
# Create workspace with specific embedding model
curl -X POST http://localhost:8080/api/v1/tenants/default/workspaces \
-H "Content-Type: application/json" \
-d '{
"name": "research",
"embedding_model": "text-embedding-3-large",
"embedding_dimension": 3072
}'
Terminal window
# Environment variables
export EDGEQUAKE_EMBEDDING_PROVIDER="openai"
export EDGEQUAKE_EMBEDDING_MODEL="text-embedding-3-small"
export EDGEQUAKE_EMBEDDING_DIMENSION="1536"
[defaults]
embedding_provider = "openai"
embedding_model = "text-embedding-3-small"
[providers.openai.embedding_models.text-embedding-3-small]
display_name = "Text Embedding 3 Small"
dimensions = 1536
max_tokens = 8191
price_per_1k_input_tokens = 0.00002
[providers.openai.embedding_models.text-embedding-3-large]
display_name = "Text Embedding 3 Large"
dimensions = 3072
max_tokens = 8191
price_per_1k_input_tokens = 0.00013

Warning: Changing embedding models requires rebuilding all embeddings.

Terminal window
# 1. Update workspace settings
curl -X PUT http://localhost:8080/api/v1/workspaces/$WORKSPACE_ID \
-d '{"embedding_model": "text-embedding-3-large", "embedding_dimension": 3072}'
# 2. Rebuild embeddings (this reprocesses all documents)
curl -X POST http://localhost:8080/api/v1/workspaces/$WORKSPACE_ID/rebuild-embeddings
# 3. Monitor progress
curl http://localhost:8080/api/v1/tasks?status=running

Why Rebuild Is Required:

  • Different models produce different vector spaces
  • Vectors from different models are not comparable
  • Search would return incorrect results with mixed embeddings

┌─────────────────────────────────────────────────────────────────┐
│ BATCH VS SEQUENTIAL │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Sequential (slow): │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Text 1 → API → Wait → Text 2 → API → Wait → ... │ │
│ │ │ │
│ │ 100 texts × 100ms = 10 seconds │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ Batched (fast): │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ [Text 1, Text 2, ..., Text 100] → API → All embeddings │ │
│ │ │ │
│ │ 1 API call = 150ms │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ Speedup: 67x faster with batching │
│ │
└─────────────────────────────────────────────────────────────────┘

EdgeQuake caches query embeddings in memory:

// Pseudo-code: Query embedding caching
let cache_key = hash(query_text);
if let Some(embedding) = cache.get(cache_key) {
return embedding; // Cache hit: 0ms
}
let embedding = provider.embed_one(query).await?; // 50ms
cache.insert(cache_key, embedding);
return embedding;
-- Create optimized HNSW index
CREATE INDEX CONCURRENTLY embeddings_hnsw_idx
ON embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Tune search quality/speed tradeoff
SET hnsw.ef_search = 100; -- Higher = better recall, slower
ef_searchRecallLatency
4095%10ms
10098%20ms
20099%40ms

ModelCost/1M tokens100K Docs (500 tokens each)
text-embedding-3-small$0.02$1.00
text-embedding-3-large$0.13$6.50
text-embedding-ada-002$0.10$5.00
ModelGPU VRAMTokens/sec
nomic-embed-text1.5 GB500
mxbai-embed-large2 GB300
all-minilm0.5 GB1000

Error: Vector dimension 768 does not match index dimension 1536

Cause: Embedding model changed without rebuilding index.

Solution:

Terminal window
curl -X POST http://localhost:8080/api/v1/workspaces/$WORKSPACE_ID/rebuild-embeddings
Error: CUDA out of memory

Solution: Use smaller model or reduce batch size:

Terminal window
ollama pull all-minilm # Smaller model
Error: Rate limit exceeded

Solution: EdgeQuake automatically retries with backoff. For high throughput:

  • Use Tier 2+ OpenAI account
  • Or use local Ollama for embedding

  1. Consistency: Use same embedding model for entire workspace
  2. Match Dimensions: Ensure workspace dimension matches model output
  3. Batch When Possible: Reduce API calls by batching texts
  4. Monitor Costs: Track embedding token usage in cost dashboard
  5. Consider Local: Use Ollama for sensitive data or high volume
  6. Test Before Switching: Compare quality before changing models
  7. Index Optimization: Tune HNSW parameters for your workload