Skip to content

Entity Normalization Deep-Dive

Deduplication and Merging for Clean Knowledge Graphs

Entity normalization is a critical step in building quality knowledge graphs. This guide explains how EdgeQuake transforms raw entity names into canonical forms and merges duplicate entities into unified nodes.



Without normalization, the same real-world entity appears as multiple disconnected nodes in the knowledge graph.

Consider a document mentioning “Sarah Chen” in different ways:

┌─────────────────────────────────────────────────────────────────┐
│ WITHOUT NORMALIZATION (Fragmented Graph) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────┐ │
│ │ Sarah Chen │ ← From chunk 1 │
│ └───────┬────────┘ │
│ │ WORKS_AT │
│ ▼ │
│ ┌────────────────┐ │
│ │ MIT │ │
│ └────────────────┘ │
│ │
│ ┌────────────────┐ │
│ │ sarah chen │ ← From chunk 2 (different node!) │
│ └───────┬────────┘ │
│ │ AUTHORED │
│ ▼ │
│ ┌────────────────┐ │
│ │ Climate Paper │ │
│ └────────────────┘ │
│ │
│ ┌────────────────┐ │
│ │ Dr. S. Chen │ ← From chunk 3 (yet another node!) │
│ └───────┬────────┘ │
│ │ RESEARCHES │
│ ▼ │
│ ┌────────────────┐ │
│ │ Machine Learning │ │
│ └────────────────┘ │
│ │
│ PROBLEM: 3 nodes for the same person! │
│ Relationships are disconnected. │
│ Query "Sarah Chen at MIT" misses paper authorship. │
│ │
└─────────────────────────────────────────────────────────────────┘
IssueWithout Normalization
Missing relationshipsWORKS_AT and AUTHORED never connect
Incomplete answers”What does Sarah Chen research?” misses ML
Inflated counts3 person nodes instead of 1
Failed lookupsSearch “Sarah Chen” doesn’t find “sarah chen”

EdgeQuake normalizes all entity names to a canonical format before storage.

┌─────────────────────────────────────────────────────────────────┐
│ WITH NORMALIZATION (Unified Graph) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────┐ │
│ ┌─────── │ SARAH_CHEN │ ───────┐ │
│ │ └───────┬────────┘ │ │
│ │ WORKS_AT │ AUTHORED │ RESEARCHES │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────────┐ ┌─────────────────┐ │
│ │ MIT │ │CLIMATE_PAPER │ │MACHINE_LEARNING │ │
│ └──────────┘ └──────────────┘ └─────────────────┘ │
│ │
│ RESULT: Single node with all relationships! │
│ "Sarah Chen at MIT" now finds paper AND ML research │
│ │
└─────────────────────────────────────────────────────────────────┘
Raw InputNormalized Output
”Sarah Chen”SARAH_CHEN
”sarah chen”SARAH_CHEN
”Dr. S. Chen”DR._S._CHEN
”The Company”COMPANY
”John’s Research”JOHN_RESEARCH

The normalize_entity_name() function applies these transformations in order:

┌─────────────────────────────────────────────────────────────────┐
│ NORMALIZATION PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Input: " The John Doe's Company " │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Step 1: TRIM WHITESPACE │ │
│ │ " The John Doe's Company " │ │
│ │ → "The John Doe's Company" │ │
│ └─────────────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Step 2: REMOVE PREFIXES │ │
│ │ Removes: "The ", "A ", "An " │ │
│ │ "The John Doe's Company" │ │
│ │ → "John Doe's Company" │ │
│ └─────────────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Step 3: SPLIT BY WHITESPACE │ │
│ │ "John Doe's Company" │ │
│ │ → ["John", "Doe's", "Company"] │ │
│ └─────────────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Step 4: REMOVE POSSESSIVES │ │
│ │ Each word: strip "'s" suffix │ │
│ │ ["John", "Doe's", "Company"] │ │
│ │ → ["John", "Doe", "Company"] │ │
│ └─────────────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Step 5: TITLE CASE EACH WORD │ │
│ │ First letter upper, rest lower │ │
│ │ ["John", "Doe", "Company"] │ │
│ │ → ["John", "Doe", "Company"] │ │
│ └─────────────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Step 6: JOIN WITH UNDERSCORES │ │
│ │ ["John", "Doe", "Company"] │ │
│ │ → "John_Doe_Company" │ │
│ └─────────────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Step 7: UPPERCASE │ │
│ │ "John_Doe_Company" │ │
│ │ → "JOHN_DOE_COMPANY" │ │
│ └─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Output: "JOHN_DOE_COMPANY" │
│ │
└─────────────────────────────────────────────────────────────────┘
pub fn normalize_entity_name(raw_name: &str) -> String {
let trimmed = raw_name.trim();
// Remove common prefixes
let without_prefix = trimmed
.strip_prefix("The ")
.or_else(|| trimmed.strip_prefix("the "))
.or_else(|| trimmed.strip_prefix("A "))
.or_else(|| trimmed.strip_prefix("An "))
.unwrap_or(trimmed);
// Split, normalize each word, rejoin
without_prefix
.split_whitespace()
.filter(|w| !w.is_empty())
.map(|word| {
let without_possessive = word
.strip_suffix("'s")
.or_else(|| word.strip_suffix("'s"))
.unwrap_or(word);
to_title_case(without_possessive)
})
.collect::<Vec<_>>()
.join("_")
.to_uppercase()
}

When the same entity appears in multiple documents, EdgeQuake merges them intelligently.

┌─────────────────────────────────────────────────────────────────┐
│ ENTITY MERGE FLOW │
├─────────────────────────────────────────────────────────────────┤
│ │
│ New Entity: "SARAH_CHEN" │
│ Description: "A climate scientist at MIT" │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Query: Does SARAH_CHEN exist? │ │
│ └─────────────────┬───────────────────┘ │
│ │ │
│ ┌──────────┴──────────┐ │
│ │ │ │
│ NO YES │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────────────────────┐ │
│ │ CREATE │ │ MERGE │ │
│ │ new node │ │ │ │
│ │ │ │ 1. Combine descriptions │ │
│ │ properties: │ │ 2. Max(importance) │ │
│ │ - desc │ │ 3. Append source_ids │ │
│ │ - type │ │ 4. Update timestamp │ │
│ │ - source │ │ │ │
│ └─────────────┘ └─────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘

EdgeQuake supports two strategies for combining entity descriptions:

When the same entity is described differently in multiple chunks, the LLM synthesizes a unified description:

┌─────────────────────────────────────────────────────────────────┐
│ LLM DESCRIPTION MERGE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Existing: "Sarah Chen is a professor at MIT" │
│ New: "Dr. Chen researches climate modeling" │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ LLM Summarizer │ │
│ │ │ │
│ │ Prompt: │ │
│ │ "Merge these │ │
│ │ descriptions │ │
│ │ for SARAH_CHEN│ │
│ │ into a single │ │
│ │ coherent │ │
│ │ description" │ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ Merged: "Sarah Chen is a professor at MIT who researches │
│ climate modeling" │
│ │
└─────────────────────────────────────────────────────────────────┘

If LLM is unavailable or fails, descriptions are concatenated with deduplication:

fn merge_descriptions(old: &str, new: &str, max_len: usize) -> String {
if old.contains(new) {
return old.to_string(); // Avoid duplicates
}
let merged = format!("{} {}", old, new);
if merged.len() > max_len {
merged[..max_len].to_string()
} else {
merged
}
}

EdgeQuake maintains provenance for all entity occurrences:

{
"id": "SARAH_CHEN",
"entity_type": "PERSON",
"description": "Professor at MIT researching climate modeling",
"source_ids": "chunk_001|chunk_042|chunk_089",
"source_document_ids": ["doc_1", "doc_2"],
"importance": 0.85,
"first_seen": "2024-01-15T10:30:00Z",
"last_updated": "2024-01-15T11:45:00Z"
}

This enables:

  • Citation tracking: Link answers back to source documents
  • Cascade delete: Remove entity when source documents deleted
  • Confidence scoring: More sources = higher confidence

The MergerConfig struct controls merging behavior:

pub struct MergerConfig {
pub max_description_length: usize, // Default: 4096
pub description_decay: f32, // Default: 0.9
pub min_importance: f32, // Default: 0.1
pub max_sources: usize, // Default: 10
pub use_llm_summarization: bool, // Default: true
}
ParameterDefaultDescriptionTuning Recommendation
max_description_length4096Max chars in merged descriptionIncrease for detailed entities
description_decay0.9Weight decay for older descriptionsLower = newer descriptions preferred
min_importance0.1Entities below this are prunedRaise to reduce noise
max_sources10Max source_ids tracked per entityIncrease for better lineage
use_llm_summarizationtrueUse LLM for description mergingDisable for faster, cheaper merging

Some characters are preserved to maintain meaning:

InputOutputNote
”New-York”NEW-YORKHyphens preserved
”C++“C++Programming language syntax
”O’Brien”O’BRIENIrish names
”AT&T”AT&TAmpersand preserved

Acronyms normalize to uppercase naturally:

InputOutput
”MIT”MIT
”N.A.S.A.”N.A.S.A.
”NATO”NATO
normalize_entity_name("") // → ""
normalize_entity_name(" ") // → ""
normalize_entity_name("The") // → "THE" (single word kept)
normalize_entity_name("A") // → "A"

From production benchmarks:

ScenarioRaw EntitiesAfter NormalizationDedup Rate
Scientific papers503236%
News articles804840%
Legal documents1208529%
Mixed corpus20012836%
┌─────────────────────────────────────────────────────────────────┐
│ QUALITY IMPROVEMENT │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Query: "What did Sarah Chen publish?" │
│ │
│ Without Normalization: │
│ ───────────────────── │
│ Found: 1 paper (from "Sarah Chen" node only) │
│ Missed: 2 papers (from "sarah chen" and "S. Chen" nodes) │
│ Recall: 33% │
│ │
│ With Normalization: │
│ ────────────────── │
│ Found: 3 papers (all linked to SARAH_CHEN) │
│ Missed: 0 │
│ Recall: 100% │
│ │
│ IMPROVEMENT: 3x better recall │
│ │
└─────────────────────────────────────────────────────────────────┘

Clean input text before extraction to improve entity quality:

// Before extraction
let text = text
.replace("Dr. ", "") // Remove titles
.replace("Prof. ", "") // Remove titles
.replace("Mr. ", "") // Remove honorifics
.replace("Mrs. ", "");

Use consistent entity types across documents:

GoodBad
PERSONPerson, person, HUMAN
ORGANIZATIONOrg, Company, COMPANY
LOCATIONPlace, Location, GEO

For known aliases, consider pre-normalization mapping:

fn apply_aliases(name: &str) -> String {
match name.to_uppercase().as_str() {
"USA" | "US" | "AMERICA" => "UNITED_STATES",
"NYC" | "NEW YORK CITY" => "NEW_YORK",
_ => normalize_entity_name(name)
}
}

Track these metrics in production:

  • Dedup rate: Should be 25-50% for typical corpora
  • False merges: Manually review sample for incorrect merges
  • Description quality: Check that merged descriptions are coherent