Entity Normalization Deep-Dive
Entity Normalization Deep-Dive
Section titled “Entity Normalization Deep-Dive”Deduplication and Merging for Clean Knowledge Graphs
Entity normalization is a critical step in building quality knowledge graphs. This guide explains how EdgeQuake transforms raw entity names into canonical forms and merges duplicate entities into unified nodes.
Table of Contents
Section titled “Table of Contents”- The Problem
- The Solution
- Normalization Algorithm
- Merge Strategy
- Configuration
- Edge Cases
- Quality Metrics
- Best Practices
The Problem
Section titled “The Problem”Without normalization, the same real-world entity appears as multiple disconnected nodes in the knowledge graph.
Graph Fragmentation Example
Section titled “Graph Fragmentation Example”Consider a document mentioning “Sarah Chen” in different ways:
┌─────────────────────────────────────────────────────────────────┐│ WITHOUT NORMALIZATION (Fragmented Graph) │├─────────────────────────────────────────────────────────────────┤│ ││ ┌────────────────┐ ││ │ Sarah Chen │ ← From chunk 1 ││ └───────┬────────┘ ││ │ WORKS_AT ││ ▼ ││ ┌────────────────┐ ││ │ MIT │ ││ └────────────────┘ ││ ││ ┌────────────────┐ ││ │ sarah chen │ ← From chunk 2 (different node!) ││ └───────┬────────┘ ││ │ AUTHORED ││ ▼ ││ ┌────────────────┐ ││ │ Climate Paper │ ││ └────────────────┘ ││ ││ ┌────────────────┐ ││ │ Dr. S. Chen │ ← From chunk 3 (yet another node!) ││ └───────┬────────┘ ││ │ RESEARCHES ││ ▼ ││ ┌────────────────┐ ││ │ Machine Learning │ ││ └────────────────┘ ││ ││ PROBLEM: 3 nodes for the same person! ││ Relationships are disconnected. ││ Query "Sarah Chen at MIT" misses paper authorship. ││ │└─────────────────────────────────────────────────────────────────┘Impact on Query Quality
Section titled “Impact on Query Quality”| Issue | Without Normalization |
|---|---|
| Missing relationships | WORKS_AT and AUTHORED never connect |
| Incomplete answers | ”What does Sarah Chen research?” misses ML |
| Inflated counts | 3 person nodes instead of 1 |
| Failed lookups | Search “Sarah Chen” doesn’t find “sarah chen” |
The Solution
Section titled “The Solution”EdgeQuake normalizes all entity names to a canonical format before storage.
Unified Graph After Normalization
Section titled “Unified Graph After Normalization”┌─────────────────────────────────────────────────────────────────┐│ WITH NORMALIZATION (Unified Graph) │├─────────────────────────────────────────────────────────────────┤│ ││ ┌────────────────┐ ││ ┌─────── │ SARAH_CHEN │ ───────┐ ││ │ └───────┬────────┘ │ ││ │ WORKS_AT │ AUTHORED │ RESEARCHES ││ ▼ ▼ ▼ ││ ┌──────────┐ ┌──────────────┐ ┌─────────────────┐ ││ │ MIT │ │CLIMATE_PAPER │ │MACHINE_LEARNING │ ││ └──────────┘ └──────────────┘ └─────────────────┘ ││ ││ RESULT: Single node with all relationships! ││ "Sarah Chen at MIT" now finds paper AND ML research ││ │└─────────────────────────────────────────────────────────────────┘Normalization Transform
Section titled “Normalization Transform”| Raw Input | Normalized Output |
|---|---|
| ”Sarah Chen” | SARAH_CHEN |
| ”sarah chen” | SARAH_CHEN |
| ”Dr. S. Chen” | DR._S._CHEN |
| ”The Company” | COMPANY |
| ”John’s Research” | JOHN_RESEARCH |
Normalization Algorithm
Section titled “Normalization Algorithm”The normalize_entity_name() function applies these transformations in order:
┌─────────────────────────────────────────────────────────────────┐│ NORMALIZATION PIPELINE │├─────────────────────────────────────────────────────────────────┤│ ││ Input: " The John Doe's Company " ││ │ ││ ▼ ││ ┌─────────────────────────────────────┐ ││ │ Step 1: TRIM WHITESPACE │ ││ │ " The John Doe's Company " │ ││ │ → "The John Doe's Company" │ ││ └─────────────────┬───────────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────────────────────┐ ││ │ Step 2: REMOVE PREFIXES │ ││ │ Removes: "The ", "A ", "An " │ ││ │ "The John Doe's Company" │ ││ │ → "John Doe's Company" │ ││ └─────────────────┬───────────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────────────────────┐ ││ │ Step 3: SPLIT BY WHITESPACE │ ││ │ "John Doe's Company" │ ││ │ → ["John", "Doe's", "Company"] │ ││ └─────────────────┬───────────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────────────────────┐ ││ │ Step 4: REMOVE POSSESSIVES │ ││ │ Each word: strip "'s" suffix │ ││ │ ["John", "Doe's", "Company"] │ ││ │ → ["John", "Doe", "Company"] │ ││ └─────────────────┬───────────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────────────────────┐ ││ │ Step 5: TITLE CASE EACH WORD │ ││ │ First letter upper, rest lower │ ││ │ ["John", "Doe", "Company"] │ ││ │ → ["John", "Doe", "Company"] │ ││ └─────────────────┬───────────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────────────────────┐ ││ │ Step 6: JOIN WITH UNDERSCORES │ ││ │ ["John", "Doe", "Company"] │ ││ │ → "John_Doe_Company" │ ││ └─────────────────┬───────────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────────────────────┐ ││ │ Step 7: UPPERCASE │ ││ │ "John_Doe_Company" │ ││ │ → "JOHN_DOE_COMPANY" │ ││ └─────────────────────────────────────┘ ││ │ ││ ▼ ││ Output: "JOHN_DOE_COMPANY" ││ │└─────────────────────────────────────────────────────────────────┘Rust Implementation
Section titled “Rust Implementation”pub fn normalize_entity_name(raw_name: &str) -> String { let trimmed = raw_name.trim();
// Remove common prefixes let without_prefix = trimmed .strip_prefix("The ") .or_else(|| trimmed.strip_prefix("the ")) .or_else(|| trimmed.strip_prefix("A ")) .or_else(|| trimmed.strip_prefix("An ")) .unwrap_or(trimmed);
// Split, normalize each word, rejoin without_prefix .split_whitespace() .filter(|w| !w.is_empty()) .map(|word| { let without_possessive = word .strip_suffix("'s") .or_else(|| word.strip_suffix("'s")) .unwrap_or(word); to_title_case(without_possessive) }) .collect::<Vec<_>>() .join("_") .to_uppercase()}Merge Strategy
Section titled “Merge Strategy”When the same entity appears in multiple documents, EdgeQuake merges them intelligently.
Merge Decision Flow
Section titled “Merge Decision Flow”┌─────────────────────────────────────────────────────────────────┐│ ENTITY MERGE FLOW │├─────────────────────────────────────────────────────────────────┤│ ││ New Entity: "SARAH_CHEN" ││ Description: "A climate scientist at MIT" ││ │ ││ ▼ ││ ┌─────────────────────────────────────┐ ││ │ Query: Does SARAH_CHEN exist? │ ││ └─────────────────┬───────────────────┘ ││ │ ││ ┌──────────┴──────────┐ ││ │ │ ││ NO YES ││ │ │ ││ ▼ ▼ ││ ┌─────────────┐ ┌─────────────────────────────┐ ││ │ CREATE │ │ MERGE │ ││ │ new node │ │ │ ││ │ │ │ 1. Combine descriptions │ ││ │ properties: │ │ 2. Max(importance) │ ││ │ - desc │ │ 3. Append source_ids │ ││ │ - type │ │ 4. Update timestamp │ ││ │ - source │ │ │ ││ └─────────────┘ └─────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────┘Description Merging Strategies
Section titled “Description Merging Strategies”EdgeQuake supports two strategies for combining entity descriptions:
1. LLM-Based Summarization (Default)
Section titled “1. LLM-Based Summarization (Default)”When the same entity is described differently in multiple chunks, the LLM synthesizes a unified description:
┌─────────────────────────────────────────────────────────────────┐│ LLM DESCRIPTION MERGE │├─────────────────────────────────────────────────────────────────┤│ ││ Existing: "Sarah Chen is a professor at MIT" ││ New: "Dr. Chen researches climate modeling" ││ │ ││ ▼ ││ ┌─────────────────┐ ││ │ LLM Summarizer │ ││ │ │ ││ │ Prompt: │ ││ │ "Merge these │ ││ │ descriptions │ ││ │ for SARAH_CHEN│ ││ │ into a single │ ││ │ coherent │ ││ │ description" │ ││ └────────┬────────┘ ││ │ ││ ▼ ││ Merged: "Sarah Chen is a professor at MIT who researches ││ climate modeling" ││ │└─────────────────────────────────────────────────────────────────┘2. Simple Concatenation (Fallback)
Section titled “2. Simple Concatenation (Fallback)”If LLM is unavailable or fails, descriptions are concatenated with deduplication:
fn merge_descriptions(old: &str, new: &str, max_len: usize) -> String { if old.contains(new) { return old.to_string(); // Avoid duplicates }
let merged = format!("{} {}", old, new); if merged.len() > max_len { merged[..max_len].to_string() } else { merged }}Source Lineage Tracking
Section titled “Source Lineage Tracking”EdgeQuake maintains provenance for all entity occurrences:
{ "id": "SARAH_CHEN", "entity_type": "PERSON", "description": "Professor at MIT researching climate modeling", "source_ids": "chunk_001|chunk_042|chunk_089", "source_document_ids": ["doc_1", "doc_2"], "importance": 0.85, "first_seen": "2024-01-15T10:30:00Z", "last_updated": "2024-01-15T11:45:00Z"}This enables:
- Citation tracking: Link answers back to source documents
- Cascade delete: Remove entity when source documents deleted
- Confidence scoring: More sources = higher confidence
Configuration
Section titled “Configuration”The MergerConfig struct controls merging behavior:
pub struct MergerConfig { pub max_description_length: usize, // Default: 4096 pub description_decay: f32, // Default: 0.9 pub min_importance: f32, // Default: 0.1 pub max_sources: usize, // Default: 10 pub use_llm_summarization: bool, // Default: true}Parameter Guide
Section titled “Parameter Guide”| Parameter | Default | Description | Tuning Recommendation |
|---|---|---|---|
max_description_length | 4096 | Max chars in merged description | Increase for detailed entities |
description_decay | 0.9 | Weight decay for older descriptions | Lower = newer descriptions preferred |
min_importance | 0.1 | Entities below this are pruned | Raise to reduce noise |
max_sources | 10 | Max source_ids tracked per entity | Increase for better lineage |
use_llm_summarization | true | Use LLM for description merging | Disable for faster, cheaper merging |
Edge Cases
Section titled “Edge Cases”Special Characters
Section titled “Special Characters”Some characters are preserved to maintain meaning:
| Input | Output | Note |
|---|---|---|
| ”New-York” | NEW-YORK | Hyphens preserved |
| ”C++“ | C++ | Programming language syntax |
| ”O’Brien” | O’BRIEN | Irish names |
| ”AT&T” | AT&T | Ampersand preserved |
Acronyms
Section titled “Acronyms”Acronyms normalize to uppercase naturally:
| Input | Output |
|---|---|
| ”MIT” | MIT |
| ”N.A.S.A.” | N.A.S.A. |
| ”NATO” | NATO |
Empty or Invalid
Section titled “Empty or Invalid”normalize_entity_name("") // → ""normalize_entity_name(" ") // → ""normalize_entity_name("The") // → "THE" (single word kept)normalize_entity_name("A") // → "A"Quality Metrics
Section titled “Quality Metrics”Deduplication Rates
Section titled “Deduplication Rates”From production benchmarks:
| Scenario | Raw Entities | After Normalization | Dedup Rate |
|---|---|---|---|
| Scientific papers | 50 | 32 | 36% |
| News articles | 80 | 48 | 40% |
| Legal documents | 120 | 85 | 29% |
| Mixed corpus | 200 | 128 | 36% |
Quality Impact
Section titled “Quality Impact”┌─────────────────────────────────────────────────────────────────┐│ QUALITY IMPROVEMENT │├─────────────────────────────────────────────────────────────────┤│ ││ Query: "What did Sarah Chen publish?" ││ ││ Without Normalization: ││ ───────────────────── ││ Found: 1 paper (from "Sarah Chen" node only) ││ Missed: 2 papers (from "sarah chen" and "S. Chen" nodes) ││ Recall: 33% ││ ││ With Normalization: ││ ────────────────── ││ Found: 3 papers (all linked to SARAH_CHEN) ││ Missed: 0 ││ Recall: 100% ││ ││ IMPROVEMENT: 3x better recall ││ │└─────────────────────────────────────────────────────────────────┘Best Practices
Section titled “Best Practices”1. Pre-extraction Cleaning
Section titled “1. Pre-extraction Cleaning”Clean input text before extraction to improve entity quality:
// Before extractionlet text = text .replace("Dr. ", "") // Remove titles .replace("Prof. ", "") // Remove titles .replace("Mr. ", "") // Remove honorifics .replace("Mrs. ", "");2. Entity Type Consistency
Section titled “2. Entity Type Consistency”Use consistent entity types across documents:
| Good | Bad |
|---|---|
| PERSON | Person, person, HUMAN |
| ORGANIZATION | Org, Company, COMPANY |
| LOCATION | Place, Location, GEO |
3. Alias Mapping
Section titled “3. Alias Mapping”For known aliases, consider pre-normalization mapping:
fn apply_aliases(name: &str) -> String { match name.to_uppercase().as_str() { "USA" | "US" | "AMERICA" => "UNITED_STATES", "NYC" | "NEW YORK CITY" => "NEW_YORK", _ => normalize_entity_name(name) }}4. Monitor Deduplication Rates
Section titled “4. Monitor Deduplication Rates”Track these metrics in production:
- Dedup rate: Should be 25-50% for typical corpora
- False merges: Manually review sample for incorrect merges
- Description quality: Check that merged descriptions are coherent
See Also
Section titled “See Also”- LightRAG Algorithm - The full extraction pipeline
- Entity Extraction - How entities are identified
- Knowledge Graph - Graph structure
- Query Modes - How normalized entities are queried