Skip to content

Entity Extraction

Entity extraction uses LLMs to identify people, organizations, concepts, and their relationships from unstructured text.


Entity extraction is the process of:

  1. Identifying meaningful entities in text (people, places, concepts)
  2. Classifying them by type (PERSON, ORGANIZATION, CONCEPT)
  3. Describing their attributes based on context
  4. Connecting them through explicit relationships

EdgeQuake uses LLMs as “knowledge engineers” — transforming unstructured text into structured knowledge.


Traditional NLP used rule-based extractors or trained models. EdgeQuake uses LLMs because:

ApproachProsCons
RulesFast, predictableBrittle, domain-specific
Trained NERAccurate for known typesRequires training data
LLM-basedDomain-agnostic, rich descriptionsSlower, requires API

LLM extraction provides:

  • Zero-shot extraction: Works on any domain without training
  • Rich descriptions: Not just labels, but context
  • Relationship inference: Understands connections, not just entities

EdgeQuake’s default entity types:

┌─────────────────────────────────────────────────────────────────┐
│ ENTITY TYPES │
├─────────────────────────────────────────────────────────────────┤
│ │
│ PERSON │ People, characters, individuals │
│ ORGANIZATION │ Companies, institutions, teams │
│ LOCATION │ Places, regions, coordinates │
│ EVENT │ Occurrences, meetings, milestones │
│ CONCEPT │ Ideas, theories, methods │
│ TECHNOLOGY │ Tools, systems, platforms │
│ PRODUCT │ Items, services, offerings │
│ OTHER │ Fallback for uncategorized │
│ │
└─────────────────────────────────────────────────────────────────┘

Custom types: You can configure domain-specific types like PROTEIN, DISEASE, or LEGAL_TERM.


┌─────────────────────────────────────────────────────────────────┐
│ EXTRACTION PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ │
│ │ TEXT │ "Dr. Sarah Chen leads the AI team at Quantum │
│ │ CHUNK │ Dynamics Lab. Her research on neural networks │
│ │ │ has been cited 500 times." │
│ └────┬─────┘ │
│ │ │
│ v │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ LLM EXTRACTION │ │
│ │ System: "You are a Knowledge Graph Specialist..." │ │
│ │ User: "Extract entities and relationships from..." │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ RAW OUTPUT (Tuples) │ │
│ │ entity<|#|>SARAH_CHEN<|#|>PERSON<|#|>Lead researcher... │ │
│ │ entity<|#|>QUANTUM_LAB<|#|>ORG<|#|>Research institution │ │
│ │ entity<|#|>NEURAL_NETWORKS<|#|>CONCEPT<|#|>ML approach │ │
│ │ relation<|#|>SARAH_CHEN<|#|>QUANTUM_LAB<|#|>works_at... │ │
│ │ <|COMPLETE|> │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ PARSED RESULT │ │
│ │ Entities: 3 │ │
│ │ Relationships: 3 │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘

Relationships connect entities with typed edges:

┌───────────────┐ WORKS_AT ┌───────────────┐
│ SARAH_CHEN │────────────────────────▶│ QUANTUM_LAB │
│ (PERSON) │ │ (ORGANIZATION)│
└───────────────┘ └───────────────┘
│ RESEARCHES
v
┌───────────────┐
│NEURAL_NETWORKS│
│ (CONCEPT) │
└───────────────┘

Each relationship includes:

  • Source entity: Starting node
  • Target entity: Ending node
  • Type/Keywords: Relationship category
  • Description: Context from the text

Before storing, entity names are normalized to prevent duplicates:

Raw InputNormalized Output
"John Doe"JOHN_DOE
"john doe"JOHN_DOE
"the Company"COMPANY
"John's team"JOHN_TEAM

Why normalize?

  • Prevents “John Doe” and “john doe” becoming separate nodes
  • Enables entity merging across documents
  • Improves query accuracy

See normalizer.rs for implementation.


EdgeQuake uses tuple-delimited format by default:

entity<|#|>NAME<|#|>TYPE<|#|>DESCRIPTION

Why not JSON?

AspectTuple FormatJSON Format
Streaming✅ Line-by-line❌ Need complete structure
Partial recovery✅ Parse valid lines❌ All or nothing
LLM reliability✅ Fewer errors❌ Escaping issues

LLMs sometimes miss entities. Gleaning performs a second pass:

Pass 1: "Extract entities from this text..."
Result: SARAH_CHEN, QUANTUM_LAB
Pass 2: "What entities did you miss? Already found: SARAH_CHEN, QUANTUM_LAB"
Result: NEURAL_NETWORKS, AI_RESEARCH (missed in first pass)

Research shows 1-2 gleaning iterations improve recall by 15-25%.