Deep Dive: Gleaning
Deep Dive: Gleaning
Section titled “Deep Dive: Gleaning”Multi-Pass Extraction for Comprehensive Entity Discovery
Gleaning is EdgeQuake’s iterative re-extraction strategy that improves entity recall by prompting the LLM to find entities it missed in previous passes.
Overview
Section titled “Overview”Single-pass LLM extraction typically captures 65-80% of entities in a document. Gleaning increases this to 90%+ through iterative refinement:
┌─────────────────────────────────────────────────────────────────┐│ GLEANING CONCEPT │├─────────────────────────────────────────────────────────────────┤│ ││ Without Gleaning: ││ ──────────────── ││ Document (100 entities) ──▶ Single Pass ──▶ 70 entities found ││ (30 missed) ││ ││ With Gleaning (2 iterations): ││ ───────────────────────────── ││ Document (100 entities) ──┬─▶ Pass 1 ──▶ 70 entities ││ │ ││ ├─▶ Pass 2 ──▶ +18 entities ││ │ "What did you miss?" ││ │ ││ └─▶ Pass 3 ──▶ +7 entities ││ "What else?" ││ ││ Total: 95 entities (95% recall) ││ │└─────────────────────────────────────────────────────────────────┘Why LLMs Miss Entities
Section titled “Why LLMs Miss Entities”LLMs miss entities due to several factors:
| Factor | Description | Example |
|---|---|---|
| Attention Limits | Long texts exceed attention span | Later paragraphs ignored |
| Implicit References | Pronouns, indirect mentions | ”the company” vs “Apple” |
| Context Overload | Many entities compete for attention | Dense technical docs |
| Entity Type Bias | Some types harder to recognize | Abstract concepts |
| Format Challenges | Tables, lists, code blocks | Structured data |
How Gleaning Works
Section titled “How Gleaning Works”The Algorithm
Section titled “The Algorithm”┌─────────────────────────────────────────────────────────────────┐│ GLEANING ALGORITHM │├─────────────────────────────────────────────────────────────────┤│ ││ FUNCTION glean(chunk, max_iterations): ││ ││ 1. all_entities = [] ││ 2. all_relationships = [] ││ ││ 3. // First pass: normal extraction ││ result = extract(chunk) ││ all_entities.extend(result.entities) ││ all_relationships.extend(result.relationships) ││ ││ 4. FOR i IN 1..max_iterations: ││ ││ 5. previous_names = all_entities.map(e => e.name) ││ ││ 6. // Gleaning prompt ││ prompt = """ ││ MANY entities were missed in the last extraction. ││ Already found: {previous_names} ││ Find ADDITIONAL entities and relationships. ││ """ ││ ││ 7. new_result = extract_with_prompt(chunk, prompt) ││ ││ 8. IF new_result.entities.is_empty(): ││ BREAK // No more entities to find ││ ││ 9. all_entities.extend(new_result.entities) ││ all_relationships.extend(new_result.relationships) ││ ││ 10. RETURN deduplicate(all_entities, all_relationships) ││ │└─────────────────────────────────────────────────────────────────┘Step-by-Step Example
Section titled “Step-by-Step Example”Input Text:
Dr. Sarah Chen, a researcher at MIT's Computer Science department,developed a novel approach to neural network optimization. Her work,published in Nature, builds on gradient descent methods pioneered byGeoffrey Hinton. The project received funding from the NSF and wasimplemented using TensorFlow.Pass 1 (Normal Extraction):
Entities Found: - SARAH_CHEN (PERSON) - MIT (ORGANIZATION) - NEURAL_NETWORK (CONCEPT)
Relationships: - SARAH_CHEN → works_at → MITPass 2 (Gleaning):
Prompt: “MANY entities were missed. Already found: SARAH_CHEN, MIT, NEURAL_NETWORK. Find ADDITIONAL entities.”
Additional Entities: - GRADIENT_DESCENT (METHOD) - GEOFFREY_HINTON (PERSON) - NATURE (PUBLICATION) - NSF (ORGANIZATION) - TENSORFLOW (TECHNOLOGY)
Additional Relationships: - SARAH_CHEN → published_in → NATURE - GRADIENT_DESCENT → pioneered_by → GEOFFREY_HINTON - PROJECT → funded_by → NSFFinal Result (Merged):
- 8 entities (vs 3 without gleaning)
- 4 relationships (vs 1 without gleaning)
Implementation
Section titled “Implementation”GleaningConfig
Section titled “GleaningConfig”/// Configuration for gleaning (re-extraction).#[derive(Debug, Clone, Serialize, Deserialize)]pub struct GleaningConfig { /// Maximum number of gleaning iterations. pub max_gleaning: usize,
/// Whether to continue extraction even if first pass finds entities. pub always_glean: bool,}
impl Default for GleaningConfig { fn default() -> Self { Self { max_gleaning: 1, // LightRAG default always_glean: false, } }}GleaningExtractor
Section titled “GleaningExtractor”/// A wrapper extractor that performs gleaning.pub struct GleaningExtractor { /// The underlying LLM provider. llm_provider: Arc<dyn LLMProvider>,
/// The base extractor to use. base_extractor: Arc<dyn EntityExtractor>,
/// Gleaning configuration. config: GleaningConfig,}
impl GleaningExtractor { /// Create a new gleaning extractor. pub fn new( llm_provider: Arc<dyn LLMProvider>, base_extractor: Arc<dyn EntityExtractor>, ) -> Self { Self { llm_provider, base_extractor, config: GleaningConfig::default(), } }
/// Set maximum gleaning iterations. pub fn with_max_gleaning(mut self, max: usize) -> Self { self.config.max_gleaning = max; self }}Gleaning Prompt
Section titled “Gleaning Prompt”fn build_gleaning_prompt(&self, text: &str, previous_entities: &[String]) -> String { let prev_entities_str = previous_entities.join(", ");
format!(r#"MANY entities and relationships were missed in the last extraction.Please identify any ADDITIONAL entities and relationships.
## Already Identified Entities{prev_entities_str}
## InstructionsLook for entities and relationships that were missed:- Implicit entities (mentioned indirectly)- Additional relationships between known entities- Contextual entities (dates, locations, concepts)
## Text to Re-Analyze{text}
## JSON Response "#)}Effectiveness Analysis
Section titled “Effectiveness Analysis”Recall by Iteration
Section titled “Recall by Iteration”| Iterations | Entities Found | Recall | Marginal Gain |
|---|---|---|---|
| 0 (single pass) | 70 | 70% | - |
| 1 | 88 | 88% | +18% |
| 2 | 95 | 95% | +7% |
| 3 | 97 | 97% | +2% |
| 4+ | 98 | 98% | <1% |
Key Insight: Diminishing returns after 2 iterations.
Cost Analysis
Section titled “Cost Analysis”| Iterations | LLM Calls | Cost Multiplier | Recall |
|---|---|---|---|
| 0 | 1x | 1.0x | 70% |
| 1 | 2x | 2.0x | 88% |
| 2 | 3x | 3.0x | 95% |
| 3 | 4x | 4.0x | 97% |
Recommendation: Use 1-2 iterations for best cost/recall tradeoff.
When to Use Gleaning
Section titled “When to Use Gleaning”Enable Gleaning For:
Section titled “Enable Gleaning For:”✅ High-stakes documents
- Legal contracts
- Medical records
- Research papers
- Financial reports
✅ Dense information
- Technical specifications
- Academic papers
- Multi-topic documents
✅ Quality over speed
- When recall matters more than latency
- When documents are ingested once, queried many times
Skip Gleaning For:
Section titled “Skip Gleaning For:”❌ Simple documents
- Short emails
- Basic notes
- Low-density text
❌ High-volume ingestion
- Real-time processing
- Large-scale batch jobs
- Cost-sensitive workloads
Configuration Options
Section titled “Configuration Options”Via API
Section titled “Via API”# Upload with gleaning enabledcurl -X POST http://localhost:8080/api/v1/documents/upload \ -H "X-Workspace-ID: default" \ -F "file=@document.pdf" \ -F "gleaning_iterations=2"Via Environment
Section titled “Via Environment”# Enable gleaning globallyexport EDGEQUAKE_GLEANING_ITERATIONS=2export EDGEQUAKE_ENABLE_GLEANING=trueVia Rust SDK
Section titled “Via Rust SDK”use edgequake_pipeline::{GleaningConfig, GleaningExtractor};
let config = GleaningConfig { max_gleaning: 2, always_glean: true,};
let extractor = GleaningExtractor::new(llm, base_extractor) .with_config(config);Integration with Pipeline
Section titled “Integration with Pipeline”┌─────────────────────────────────────────────────────────────────┐│ PIPELINE WITH GLEANING │├─────────────────────────────────────────────────────────────────┤│ ││ Document ││ │ ││ ▼ ││ ┌──────────┐ ││ │ Chunking │ ──▶ chunk_1, chunk_2, chunk_3, ... ││ └──────────┘ ││ │ ││ ▼ ││ ┌──────────────────────────────────────────────────────────┐ ││ │ FOR each chunk: │ ││ │ │ ││ │ ┌─────────────────┐ │ ││ │ │ GleaningExtractor│ │ ││ │ │ │ │ ││ │ │ Pass 1 (base) │──▶ entities_1 │ ││ │ │ Pass 2 (glean) │──▶ entities_2 │ ││ │ │ Pass 3 (glean) │──▶ entities_3 │ ││ │ │ │ │ ││ │ │ Merge & Dedupe │──▶ final_entities │ ││ │ └─────────────────┘ │ ││ │ │ ││ └──────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌──────────────┐ ││ │ Graph Storage │ ◀── All extracted entities & relationships ││ └──────────────┘ ││ │└─────────────────────────────────────────────────────────────────┘Performance Considerations
Section titled “Performance Considerations”Latency Impact
Section titled “Latency Impact”Document Processing Time:
Without Gleaning: Chunk 1: 500ms (1 LLM call) Chunk 2: 500ms Chunk 3: 500ms Total: 1.5s
With Gleaning (2 iterations): Chunk 1: 1.5s (3 LLM calls) Chunk 2: 1.5s Chunk 3: 1.5s Total: 4.5s (3x longer)Parallelization
Section titled “Parallelization”Gleaning passes within a chunk are sequential, but chunks can be processed in parallel:
// Process chunks in parallel, gleaning is per-chunklet results = futures::future::join_all( chunks.iter().map(|chunk| { let extractor = gleaning_extractor.clone(); async move { extractor.extract(chunk).await } })).await;Quality Metrics
Section titled “Quality Metrics”Track gleaning effectiveness:
pub struct GleaningStats { /// Entities found in base extraction pub base_entities: usize,
/// Additional entities found via gleaning pub gleaned_entities: usize,
/// Total gleaning iterations performed pub iterations: usize,
/// Time spent on gleaning (ms) pub gleaning_time_ms: u64,}
// Gleaning efficiency ratiolet efficiency = stats.gleaned_entities as f32 / stats.iterations as f32;Best Practices
Section titled “Best Practices”- Start with 1 iteration - Default setting balances cost and recall
- Increase for complex docs - Research papers, legal documents benefit from 2 iterations
- Monitor marginal gains - If gleaning finds <5% more entities, reduce iterations
- Cache results - Gleaning results are cached to avoid re-processing
- Use async processing - Don’t block on gleaning for real-time applications
Troubleshooting
Section titled “Troubleshooting”Gleaning Finds No New Entities
Section titled “Gleaning Finds No New Entities”Cause: Document is simple or first pass was comprehensive
Solution: Reduce max_gleaning for simple documents
Gleaning Takes Too Long
Section titled “Gleaning Takes Too Long”Cause: Too many iterations on large documents
Solution:
- Reduce
max_gleaning - Use smaller chunks (e.g., 800 tokens)
- Process in background
Duplicate Entities After Gleaning
Section titled “Duplicate Entities After Gleaning”Cause: Deduplication not merging variants
Solution: Check entity normalization settings
LightRAG Research Reference
Section titled “LightRAG Research Reference”Gleaning is based on research from the LightRAG paper (October 2024):
“Iterative re-extraction with previously-found entity context improves entity recall by 15-25% with diminishing returns after 2 iterations.”
Key findings:
- First gleaning pass: +18% entities on average
- Second gleaning pass: +7% entities on average
- Third+ passes: <2% additional entities
See Also
Section titled “See Also”- Entity Extraction - Base extraction process
- Entity Normalization - Deduplication after extraction
- Document Ingestion Tutorial - End-to-end guide
- Performance Tuning - Optimization strategies