📊 Python Document Intelligence Framework CPU Benchmarks

Comprehensive Performance Analysis of Document Intelligence Frameworks

🎯 Executive Summary

Latest Benchmark Run: Testing ALL 18 formats for comprehensive framework assessment
⚠️ Methodology Note: All frameworks are multi-format document intelligence libraries tested across all supported file types for fair comparison.

Framework Performance Rankings

Framework Speed by Category (files/sec) Success Rate Failures Memory (MB) Install Size
Tiny Small Medium Large Huge
Kreuzberg Sync 31.788.912.42 100.0% None 359.8 71MB
Kreuzberg Async 23.949.313.16 100.0% None 395.2 71MB
Unstructured 4.820.860.06 98.8% 3 timeouts 1345.8 146MB
Extractous 3.104.170.07 98.7% 3 errors 498.6 ~100MB
Docling 0.260.07 98.5% 3 errors 1757.8 1GB+
Markitdown 26.272.61 98.2% 3 errors 359.8 251MB

Success rates calculated on files actually tested by each framework. "—" indicates categories not included in this benchmark run. License details available in the Framework Details section below.

Memory Usage by Category

Framework Memory Usage by Category (MB) Avg Memory (MB)
Tiny Small Medium Large Huge
Kreuzberg Sync 348352379-- 360
Kreuzberg Async 324355507-- 395
Unstructured 95218321253-- 1346
Extractous 580446469-- 499
Docling 17941721--- 1758
Markitdown 343377--- 360

Memory usage shown as peak RSS (Resident Set Size) in MB during extraction

📊 Performance Analysis

📊 How to Read Performance Charts

1️⃣ Extraction Speed Rankings

🏆 Speed Champions (files/sec):
Multi-format frameworks showing consistent performance across all supported file types. Rankings based on current benchmark data.
Performance Comparison

📊 Performance Chart Not Available

Performance comparison data is available in the reports below.

Speed Analysis: Kreuzberg leads with 15+ files/sec, while Docling shows timeout issues on complex documents

2️⃣ Data Throughput Analysis

📊 Throughput Performance (MB/sec):
Measures actual data processing speed accounting for file sizes. Higher values indicate better scaling with document complexity.
Throughput Comparison

📊 Throughput Chart Not Available

Throughput analysis data is available in the reports below.

Throughput Insights: Multi-format frameworks show consistent performance across diverse document types

3️⃣ Success Rate Reliability

✅ Reliability Rankings (% successful):
Framework reliability varies by document type and format support. See charts for detailed comparisons.
*Success rates calculated on supported formats only
Success Rate Comparison

📊 Success Rate Chart Not Available

Success rate data is available in the reports below.

Reliability Notes: Success rates calculated only on files each framework attempts to process

📊 View Detailed Performance Report

💾 Resource Usage Analysis

📊 Memory Profiling: Peak memory usage tracked for every extraction with 50ms sampling intervals using psutil RSS measurements. Data available per file type, framework, and document size category.

📊 How to Read Memory & Resource Charts

1️⃣ Memory Usage Rankings by Framework

🏆 Memory Efficiency Ranking (Lower MB = Better):
Memory usage varies significantly by framework and document type. See detailed analysis below.
Memory Usage

📊 Memory Usage Chart Not Available

Memory usage data is available in the reports below.

Interpretation: Shows average peak memory consumption across all file types. Lower bars indicate more memory-efficient frameworks.

2️⃣ Detailed Memory Usage by File Type

📊 Format-Specific Memory Patterns:
• PDFs: Show highest memory variance (50MB - 2GB+)
• Images: Consistent high memory usage across frameworks
• Office Docs: Moderate memory requirements (200-800MB)
• Text/Markup: Lowest memory footprint (<100MB)
Memory Usage by File Type

📊 Memory Analysis Available

Detailed memory profiling data is available in the interactive dashboard and detailed reports.

View Interactive Memory Analysis

Framework Behavior: Each framework shows distinct memory patterns per file type. Frameworks optimized for specific formats use significantly less memory on their target documents.

3️⃣ Performance by Document Size Categories

📏 Size Category Performance (Speed Ranking):
Tiny (<100KB): Fast extraction | Small (100KB-1MB): Consistent performance | Medium (1-10MB): Mixed results | Large (10-50MB): Framework timeouts common
Category Analysis

📊 Category Analysis Chart Not Available

Category analysis data is available in the reports below.

Size Scaling: Performance patterns change dramatically with document size. Memory usage can increase exponentially for complex documents regardless of file size.

4️⃣ Installation Size Comparison

💿 Installation Footprint Ranking (Smaller = Better):
Framework installation sizes range from under 100MB to over 1GB depending on dependencies.
Installation Sizes

📦 Installation Size Analysis

Framework installation sizes range from 71MB to 1GB+ depending on dependencies:

Lightweight: Kreuzberg (71MB)

Moderate: Extractous (100MB), Unstructured (146MB)

Heavy: MarkItDown (251MB), Docling (1GB+)

Trade-offs: Larger installations often include ML models and extensive format support, while smaller frameworks focus on specific use cases.

📈 Key Memory Usage Insights

📋 View Detailed Memory Report

📄 Format Support Analysis

📊 How to Read Format Support Charts

2️⃣ Format Categories Overview

📄 Tested File Categories:
• Documents: PDF, DOCX, PPTX, XLSX, XLS, ODT (6 formats)
• Web/Markup: HTML, MD, RST, ORG (4 formats)
• Images: PNG, JPG, JPEG, BMP (4 formats)
• Email: EML, MSG (2 formats) | Data: CSV, JSON, YAML (3 formats)

📄 Format Categories Tested

  • Documents: PDF, DOCX, PPTX, XLSX, XLS, ODT
  • Web/Markup: HTML, MD, RST, ORG
  • Images: PNG, JPG, JPEG, BMP
  • Email: EML, MSG
  • Data: CSV, JSON, YAML
  • Text: TXT

Total: 18 different file formats across 6 categories

Format Diversity: Comprehensive testing across document types commonly encountered in real-world document intelligence scenarios.

📋 Metadata Extraction Analysis

📊 Metadata Diversity: Comprehensive analysis of metadata extraction capabilities across frameworks, covering author information, creation dates, language detection, page counts, and 20+ metadata fields per document type.

📊 How to Read Metadata Analysis

1️⃣ Metadata Coverage by Framework

📊 Metadata Extraction Leaders:
Frameworks vary significantly in metadata extraction capabilities. Multi-format tools provide comprehensive coverage across diverse document types.
Metadata Coverage

📊 Metadata Analysis Available

Comprehensive metadata extraction analysis covering 20+ fields per document type.

View Analysis

Coverage Analysis: Shows percentage of metadata fields successfully extracted by each framework across all document types.

2️⃣ Field Extraction Comparison

📋 Metadata Field Types:
• Document Properties: Title, author, creation/modification dates
• Content Metrics: Page count, word count, character count
• Technical Data: MIME type, encoding, compression info
• Quality Indicators: Language detection, format version
Field Comparison

📋 Field Comparison Data

Detailed field-by-field comparison showing which frameworks extract specific metadata types.

Download CSV

Field Analysis: Compares specific metadata field extraction capabilities across frameworks, highlighting strengths and gaps.

🔍 Metadata Extraction Capabilities

📈 Key Metadata Insights

📋 View Complete Metadata Analysis 📊 Download Field Comparison Data

✨ Quality Assessment Analysis

🎯 ML-Based Quality Metrics: Comprehensive document intelligence quality analysis using sentence transformers, readability metrics, coherence analysis, and document-specific quality checks across all frameworks and file types.

📊 How to Read Quality Assessment Charts

1️⃣ Quality Scores by Framework

🏆 Quality Rankings (Higher Score = Better):
Quality assessment provides ML-based scoring for extraction accuracy, coherence, and completeness across all tested frameworks and file types.
Quality Assessment

📊 Quality Assessment Available

Quality assessment requires enabling during benchmark execution:

uv run python -m src.cli benchmark --enable-quality-assessment
View Quality Data

Quality Metrics: Combines extraction completeness, text coherence, semantic similarity, and document-specific quality checks.

2️⃣ Readability Analysis

📖 Readability Metrics:
• Flesch Reading Ease: Higher scores = easier to read
• Gunning Fog Index: Lower scores = more accessible text
• Sentence Structure: Analysis of complexity and coherence
Readability Analysis

📖 Readability Analysis

Readability metrics computed using Flesch Reading Ease and Gunning Fog Index for extracted text quality assessment.

Enable quality assessment in benchmarks to generate readability charts

uv run python -m src.cli benchmark --enable-quality-assessment

Text Quality: Measures how well frameworks preserve readable, coherent text structure during extraction.

🔬 Quality Assessment Capabilities

📈 Quality Scoring Methodology

🎯 Key Quality Insights

💡 Enable Quality Assessment: Run benchmarks with --enable-quality-assessment flag to generate comprehensive quality metrics and visualizations.
📊 View Quality Enhanced Results 📖 Quality Analysis Report

📄 Format Support Analysis

Format Categories Tested

Framework Capabilities

🔧 Framework Details

Kreuzberg 3.8.0

License: MIT | Version: 3.8.1 | Size: 71MB base

Fast Python document intelligence with multiple OCR backends. Supports both sync and async APIs.

Strengths: Speed, small footprint, async support, comprehensive format coverage

Format Support: All tested formats except MSG (no open source support)

Commercial Use: ✅ Fully permissive MIT license

Docling

License: MIT | Version: 2.41.0 | Size: 1GB+

IBM Research's advanced document understanding with ML models.

Strengths: Advanced ML understanding, high quality

Format Support: PDF, DOCX, XLSX, PPTX, HTML, CSV, MD, AsciiDoc, Images

Commercial Use: ✅ Fully permissive MIT license

MarkItDown

License: MIT | Version: 0.0.1a2 | Size: 251MB

Microsoft's lightweight Markdown converter optimized for LLM processing.

Strengths: LLM-optimized output, ONNX performance

Limitations: Limited format support

Commercial Use: ✅ Fully permissive MIT license

Unstructured

License: Apache 2.0 | Version: 0.18.5 | Size: 146MB

Enterprise solution supporting 64+ file types.

Strengths: Widest format support, enterprise features

Limitations: Moderate speed

Commercial Use: ✅ Permissive Apache 2.0 license

Extractous

License: Apache 2.0 | Version: 0.1.0 | Size: ~100MB

Fast Rust-based extraction with Python bindings.

Strengths: Native performance, low memory usage

Format Support: Common office and web formats

Commercial Use: ✅ Permissive Apache 2.0 license

📋 Detailed Reports & Data

🌐 HTML Report | 📝 Markdown Report | 📊 JSON Metrics | 📊 Summary Data

🔬 Advanced Analysis

Additional analysis modules are available in the detailed reports section above.

📊 Table Extraction Analysis

Specialized analysis of table detection and extraction capabilities across frameworks, focusing on structure preservation, cell accuracy, and formatting retention.

Table Detection Performance

Table Detection Performance

📊 Table detection analysis available for documents with table content

View Analysis

Structure Preservation Quality

Structure Quality

📋 Table structure analysis data available in JSON format

Browse Analysis

🔍 Table Extraction Capabilities

💡 Table Analysis: Run benchmarks with --table-extraction-only flag to focus analysis on documents containing tables.

💾 Memory Profiling Data Available

📊 Per-File-Type Performance Analysis

Detailed per-file-type performance data is available in the benchmark reports above.

🔬 Performance Methodology by File Type

📐 Performance Metrics Breakdown

🎯 Key Insights from File-Type Analysis

Links