PDF Ingestion Tutorial
PDF Ingestion Tutorial
Section titled “PDF Ingestion Tutorial”EdgeQuake extracts text, tables, and metadata from PDF documents using advanced layout analysis. This tutorial shows you how to upload PDFs and configure extraction for optimal results.
What You’ll Learn
Section titled “What You’ll Learn”- Upload a PDF document (5 minutes)
- Configure extraction options (10 minutes)
- Verify extraction quality (5 minutes)
- Query PDF content (5 minutes)
Prerequisites:
- EdgeQuake server running (see Quick Start)
- A PDF file to upload
curlorhttpieinstalled
Time Estimate: 25 minutes
When to Read This
Section titled “When to Read This”Read this tutorial if:
- First time uploading PDFs
- Need quick reference for configuration options
- Want to verify extraction quality
Read PDF Processing Deep Dive if:
- Understanding extraction internals
- Advanced table detection algorithms
- Contributing to PDF crate
Read Troubleshooting Guide if:
- Extraction fails or produces poor quality
- Tables not detected correctly
- Need detailed error solutions
Theory vs Practice:
- This tutorial: “How do I upload and configure?”
- Deep dive: “How does table detection work internally?”
- Both are valuable - start here, dig deeper as needed.
Quick Start: Your First PDF Upload
Section titled “Quick Start: Your First PDF Upload”Step 1: Upload the PDF
Section titled “Step 1: Upload the PDF”# Upload with default settings (text mode)curl -X POST \ -H "Content-Type: multipart/form-data" \ -F "file=@/path/to/paper.pdf" \ -F "title=Research Paper" \ http://localhost:8080/api/v1/documentsWhat Happens:
Upload → Parse PDF → Extract text → Detect tables → Build chunks → Index → ReadyResponse:
{ "id": "doc-uuid-1234", "title": "Research Paper", "status": "completed", "content_hash": "sha256:abc123...", "chunk_count": 45, "entity_count": 23, "relationship_count": 18, "created_at": "2024-01-15T10:30:00Z", "processing_time_ms": 2340}Key Fields:
id: Use this to reference the document in queriesstatus:completedmeans extraction succeededchunk_count: Number of text chunks created (paragraphs, tables)processing_time_ms: Extraction took ~2.3 seconds
Note: Base URL is http://localhost:8080 by default. If your server uses a different port, adjust accordingly.
Step 2: Verify Upload Succeeded
Section titled “Step 2: Verify Upload Succeeded”# Check document statuscurl http://localhost:8080/api/v1/documents/doc-uuid-1234Response:
{ "id": "doc-uuid-1234", "title": "Research Paper", "status": "indexed", "metadata": { "pages": 12, "tables_detected": 3, "figures": 5 }}Look for:
- ✅
status: "indexed"- ready to query - ✅
chunk_count > 0- text extracted successfully - ✅
entity_count > 0- knowledge graph built - ⚠️
status: "failed"- see troubleshooting
Tip: For complex PDFs with tables, consider enabling table enhancement (see Configuration).
Step 3: Query the PDF Content
Section titled “Step 3: Query the PDF Content”# Ask a question about the documentcurl -X POST \ -H "Content-Type: application/json" \ -d '{ "query": "What are the key findings?", "mode": "hybrid" }' \ http://localhost:8080/api/v1/queryResponse:
{ "answer": "The key findings show that...", "sources": [ { "document_id": "doc-uuid-1234", "chunk_id": "chunk-5", "relevance": 0.94, "page": 3, "content": "The results demonstrate..." } ], "response_time_ms": 1200}Success: You’ve uploaded, indexed, and queried a PDF in < 5 minutes! 🎉
Upload Flow Diagram
Section titled “Upload Flow Diagram”┌─────────────────────────────────────────────────────────────────┐│ PDF Upload Flow │└─────────────────────────────────────────────────────────────────┘
User EdgeQuake Server Knowledge Graph │ │ │ │ POST /api/v1/documents │ │ │ (file + metadata) │ │ ├────────────────────────> │ │ │ │ │ │ │ 1. Parse PDF │ │ │ (extract pages) │ │ │ │ │ │ 2. Extract Text │ │ │ (with layout) │ │ │ │ │ │ 3. Detect Tables │ │ │ (spatial clustering) │ │ │ │ │ │ 4. Build Chunks │ │ │ (semantic units) │ │ │ │ │ │ 5. Extract Entities │ │ ├────────────────────────────────>│ │ │ (people, orgs, concept s) │ │ │ │ │ │ 6. Index for Search │ │ │<────────────────────────────────┤ │ Response: │ │ │ {id, status, chunks} │ │ │<──────────────────────── ┤ │ │ │ │ │ Query request │ │ ├────────────────────────> │ 7. Query Graph │ │ ├────────────────────────────────>│ │ │ (find relevant chunks) │ │ Response: │<────────────────────────────────┤ │ {answer, sources} │ │ │<──────────────────────── ┤ │
Total time: 2-5 seconds (text mode) | 20-50 seconds (vision mode)Configuration Options
Section titled “Configuration Options”EdgeQuake supports three extraction modes: Text, Vision, and Hybrid. Choose the mode based on your PDF quality and requirements.
When to Use What
Section titled “When to Use What”Text Mode (default, fastest):
- ✅ Good quality digital PDFs
- ✅ Standard fonts and encoding
- ✅ Simple to moderately complex layouts
- Processing Time: 2-5 seconds per document
- Cost: Free (no LLM API calls)
Vision Mode (slowest, most accurate):
- ⚠️ Scanned documents (images)
- ⚠️ Poor quality PDFs
- ⚠️ No text layer (image-only PDFs)
- ⚠️ Complex layouts or handwriting
- Processing Time: 20-50 seconds per document
- Cost: ~$0.001-0.01 per page (LLM vision API)
Hybrid Mode (automatic fallback):
- ⚠️ Mixed quality (some pages good, some poor)
- ⚠️ Unsure about PDF quality
- Processing Time: Variable (2-50 seconds)
- Cost: Only vision pages incur LLM cost
Table Enhancement:
- ⚠️ Complex table layouts
- ⚠️ Merged cells
- ⚠️ Nested tables
- Trade-off: 2x slower, better table accuracy
- Cost: ~$0.0001 per table (LLM refinement)
Example 1: Text Mode (Default)
Section titled “Example 1: Text Mode (Default)”# Default text mode - fastest, freecurl -X POST \ -H "Content-Type: multipart/form-data" \ -F "file=@report.pdf" \ -F "title=Annual Report" \ http://localhost:8080/api/v1/documentsUse for: 80% of digital PDFs
Processing: 2-5 seconds
Cost: Free
Example 2: Vision Mode (Scanned PDFs)
Section titled “Example 2: Vision Mode (Scanned PDFs)”For scanned documents or image-based PDFs, explicitly set mode to Vision:
# Vision mode for scanned documentscurl -X POST \ -H "Content-Type: multipart/form-data" \ -F "file=@scanned_book.pdf" \ -F "title=Scanned Book" \ -F 'config={"mode": "Vision", "vision_dpi": 150}' \ http://localhost:8080/api/v1/documentsConfiguration Fields:
mode:"Text","Vision", or"Hybrid"vision_dpi: DPI for rendering (150 = good quality, 200 = higher accuracy but slower)
Use for: Scanned books, poor quality PDFs, image-only PDFs
Processing: 20-50 seconds depending on page count
Cost: ~$0.001-0.01 per page (OpenAI GPT-4o-mini)
Cost Example: 50-page book at $0.005/page = $0.25 total
Example 3: Hybrid Mode (Automatic)
Section titled “Example 3: Hybrid Mode (Automatic)”Hybrid mode uses text extraction first, then falls back to vision for low-quality pages:
# Hybrid mode - automatic quality detectioncurl -X POST \ -H "Content-Type: multipart/form-data" \ -F "file=@mixed_quality.pdf" \ -F "title=Mixed Quality Document" \ -F 'config={"mode": "Hybrid", "quality_threshold": 0.7}' \ http://localhost:8080/api/v1/documentsConfiguration Fields:
quality_threshold: If text extraction confidence < this value, use vision (0.0-1.0)- Default:
0.5(switch to vision for confidence < 50%)
Use for: Unknown PDF quality, mixed content documents
Processing: 2-50 seconds depending on quality
Cost: Only low-quality pages incur vision costs
Example 4: Enhanced Table Detection
Section titled “Example 4: Enhanced Table Detection”For PDFs with complex tables (merged cells, nested structures):
# Enable LLM-based table enhancementcurl -X POST \ -H "Content-Type: multipart/form-data" \ -F "file=@financial_report.pdf" \ -F "title=Financial Report" \ -F 'config={"enhance_tables": true, "mode": "Text"}' \ http://localhost:8080/api/v1/documentsConfiguration Fields:
enhance_tables: Enable LLM refinement for tables (default:false)ai_temperature: LLM temperature for table enhancement (0.0-1.0, default: 0.1)
Use for: Financial reports, spreadsheets, data-heavy documents
Processing: 2x slower than default
Cost: ~$0.0001 per table
Result: Tables with merged cells and complex layouts correctly preserved in markdown.
Example 5: Multi-Column Layout
Section titled “Example 5: Multi-Column Layout”For academic papers and newspaper-style layouts:
# Enable column detectioncurl -X POST \ -H "Content-Type: multipart/form-data" \ -F "file=@research_paper.pdf" \ -F "title=Research Paper" \ -F 'config={"layout": {"detect_columns": true, "column_gap_threshold": 20.0}}' \ http://localhost:8080/api/v1/documentsConfiguration Fields:
layout.detect_columns: Enable multi-column detection (default:true)layout.column_gap_threshold: Minimum gap in points for column separation (default: 20.0)
Use for: Academic papers, newspapers, magazines
Processing: Minimal overhead
Cost: Free
Example 6: Full Enhancement (Complex Documents)
Section titled “Example 6: Full Enhancement (Complex Documents)”For critical documents where accuracy > speed:
# Enable all enhancementscurl -X POST \ -H "Content-Type: multipart/form-data" \ -F "file=@complex_report.pdf" \ -F "title=Complex Report" \ -F 'config={ "mode": "Vision", "enhance_tables": true, "layout": {"detect_columns": true}, "enhance_readability": true, "vision_dpi": 200 }' \ http://localhost:8080/api/v1/documentsUse for: Legal documents, critical reports, archival
Processing: 10x slower
Cost: ~$0.01 per page
Trade-off: Maximum accuracy, but significantly slower and more expensive.
Configuration Reference
Section titled “Configuration Reference”PdfConfig Fields
Section titled “PdfConfig Fields”Complete reference of available configuration options:
{ "mode": "Text", // Text | Vision | Hybrid "output_format": "Markdown", // Markdown | Json | Html | Chunks "ocr_threshold": 0.8, // OCR confidence threshold (0.0-1.0) "max_pages": null, // Limit pages to process (null = all) "include_page_numbers": true, // Include page numbers in output "extract_images": true, // Extract embedded images "enhance_tables": false, // LLM table refinement "ai_temperature": 0.1, // LLM temperature (0.0 = deterministic) "normalize_spacing": true, // Fix concatenated words "consolidate_headers": true, // Merge broken headers "extract_figure_captions": true, // Extract figure captions "enhance_readability": false, // AI full-page enhancement "vision_dpi": 150, // DPI for vision mode (150-300) "quality_threshold": 0.5, // Hybrid mode threshold "layout": { "detect_columns": true, // Multi-column detection "detect_tables": true, // Table detection "detect_equations": true, // Equation detection "column_gap_threshold": 20.0, // Column gap in points "use_xy_cut": true // XY-Cut algorithm for layout }}Defaults: Most fields have sensible defaults. Override only when needed.
Verifying Extraction Quality
Section titled “Verifying Extraction Quality”Understanding Chunk Counts
Section titled “Understanding Chunk Counts”After upload, check chunk_count to verify extraction succeeded:
{ "chunk_count": 45, // Number of semantic chunks created "entity_count": 23, // Number of entities extracted "relationship_count": 18}What Affects Chunk Count:
- PDF length (more pages → more chunks)
- Layout complexity (tables, figures → separate chunks)
- Text density (dense text → more chunks)
Typical Ranges:
- 10-page report: 20-40 chunks
- 50-page book: 100-200 chunks
- 100-page thesis: 300-500 chunks
If chunk_count = 0: Extraction failed. See troubleshooting.
Checking Extraction Details
Section titled “Checking Extraction Details”Get detailed metadata about the document:
# Get document detailscurl http://localhost:8080/api/v1/documents/doc-uuid-1234Response:
{ "id": "doc-uuid-1234", "title": "Research Paper", "status": "indexed", "metadata": { "pages": 12, "tables_detected": 3, "figures": 5, "extraction_mode": "Text", "processing_time_ms": 2340 }, "chunks": [ { "id": "chunk-1", "content": "Abstract: This paper presents...", "page": 1, "type": "text" }, { "id": "chunk-2", "content": "| Column 1 | Column 2 |\n|----------|----------|\n| A | B |", "page": 3, "type": "table" } ]}Key Metadata:
tables_detected: Number of tables foundfigures: Number of figures/imagesextraction_mode: Mode used (Text, Vision, Hybrid)- Chunks array shows actual extracted content
When to Retry with Different Config
Section titled “When to Retry with Different Config”If chunk_count < expected:
- Check if PDF is scanned → Try Vision mode
- Check if tables malformed → Enable
enhance_tables - Check if text order wrong → Enable
detect_columns
Example Iteration:
# First try: Default text modecurl -F "file=@doc.pdf" http://localhost:8080/api/v1/documents# Result: chunk_count = 5 (expected 50+) ❌
# Second try: Enable vision modecurl -F "file=@doc.pdf" \ -F 'config={"mode": "Vision"}' \ http://localhost:8080/api/v1/documents# Result: chunk_count = 52 ✅Common Patterns
Section titled “Common Patterns”Pattern 1: Multi-Page Reports
Section titled “Pattern 1: Multi-Page Reports”Scenario: 50-page annual report with text + tables
Approach:
# Start with defaultcurl -X POST \ -F "file=@annual_report.pdf" \ -F "title=Annual Report 2024" \ http://localhost:8080/api/v1/documentsCheck results:
- If
tables_detected > 0and chunks look good → ✅ Done - If tables malformed → Re-upload with
enhance_tables: true
Large Document Tip: Use max_pages to test on first 10 pages:
curl -F "file=@report.pdf" \ -F 'config={"max_pages": 10}' \ http://localhost:8080/api/v1/documentsPattern 2: Academic Papers (Multi-Column)
Section titled “Pattern 2: Academic Papers (Multi-Column)”Scenario: Research paper with two-column layout, figures, equations
Approach:
# Enable column detectioncurl -X POST \ -F "file=@research_paper.pdf" \ -F "title=AI Research Paper" \ -F 'config={ "layout": {"detect_columns": true}, "extract_figure_captions": true }' \ http://localhost:8080/api/v1/documentsTips:
- Column detection ensures text reads left-to-right within columns
- Figure captions extracted separately for better context
- Equations may not extract perfectly (vision mode helps)
Pattern 3: Scanned Books
Section titled “Pattern 3: Scanned Books”Scenario: 200-page scanned book, faded text, skewed pages
Approach:
# Vision mode for scanned documentscurl -X POST \ -F "file=@scanned_book.pdf" \ -F "title=Historical Book" \ -F 'config={ "mode": "Vision", "vision_dpi": 150, "enhance_readability": true }' \ http://localhost:8080/api/v1/documentsCost Estimate: 200 pages × $0.005/page = $1.00 total
Processing Time: ~200 pages × 10 seconds/page = 33 minutes
Tip: For long books, use max_pages to test first 10 pages, then upload full book.
Pattern 4: Financial Reports (Complex Tables)
Section titled “Pattern 4: Financial Reports (Complex Tables)”Scenario: Quarterly report with merged cells, nested tables, footnotes
Approach:
# Enable table enhancementcurl -X POST \ -F "file=@financial_report.pdf" \ -F "title=Q4 2024 Financials" \ -F 'config={ "enhance_tables": true, "ai_temperature": 0.1 }' \ http://localhost:8080/api/v1/documentsExpected Results:
- Tables preserved in markdown format
- Merged cells handled correctly
- Footnotes linked to table cells
Pattern 5: Non-English Documents
Section titled “Pattern 5: Non-English Documents”Scenario: PDF in Spanish, Chinese, Arabic
Approach:
# Vision mode with LLM handles non-English bettercurl -X POST \ -F "file=@spanish_doc.pdf" \ -F "title=Documento en Español" \ -F 'config={"mode": "Vision"}' \ http://localhost:8080/api/v1/documentsLLM Language Support:
- OpenAI GPT-4o: 100+ languages
- Ollama: Depends on model (check model docs)
Tip: Vision mode typically handles non-English better than text mode due to font encoding issues.
Troubleshooting Quick Reference
Section titled “Troubleshooting Quick Reference”See full guide: Common Issues - PDF Section
Quick Fixes Table
Section titled “Quick Fixes Table”| Issue | Solution | Config |
|---|---|---|
| No text extracted | Enable vision mode | {"mode": "Vision"} |
| Tables broken | Enable table enhancement | {"enhance_tables": true} |
| Wrong text order | Enable multi-column | {"layout": {"detect_columns": true}} |
| chunk_count = 0 | Try vision mode | {"mode": "Vision"} |
| Upload fails | Check file size/format | PDF only, < 50MB |
| Encoding errors (�) | Use vision mode | {"mode": "Vision"} |
Most Common Issues
Section titled “Most Common Issues”1. No text extracted (chunk_count = 0):
- Cause: PDF is image-based (scanned)
- Solution:
{"mode": "Vision"}
2. Tables not detected:
- Cause: Complex table layout
- Solution:
{"enhance_tables": true}
3. Text order scrambled:
- Cause: Multi-column layout
- Solution:
{"layout": {"detect_columns": true}}
When to Seek Help
Section titled “When to Seek Help”- chunk_count still 0 after vision mode
- Specific table layout not detected
- Custom fonts not supported
- Upload fails repeatedly
Next Steps:
- Read PDF Processing Deep Dive for internals
- Check Troubleshooting Guide for detailed solutions
- File GitHub issue with PDF sample
Next Steps
Section titled “Next Steps”Beginner Path
Section titled “Beginner Path”- ✅ Uploaded first PDF (this tutorial)
- ➡️ Read Document Ingestion for chunking details
- ➡️ Read Query Optimization for RAG techniques
Advanced Path
Section titled “Advanced Path”- ✅ Mastered PDF configuration (this tutorial)
- ➡️ Read PDF Processing Deep Dive for algorithms
- ➡️ Read the GitHub Contributing Guide to improve PDF crate
Troubleshooting Path
Section titled “Troubleshooting Path”- ⚠️ Encountered PDF extraction issues
- ➡️ Read Common Issues
- ➡️ File GitHub issue if problem persists
Related Documentation
Section titled “Related Documentation”- PDF Processing Deep Dive - Algorithms and internals
- Document Ingestion - Chunking and entity extraction
- REST API Reference - Complete API docs
- Troubleshooting - Error solutions
- Quick Start - Server setup