Comprehensive Performance Analysis of Document Intelligence Frameworks
Framework | Speed by Category (files/sec) | Success Rate | Failures | Memory (MB) | Install Size | ||||
---|---|---|---|---|---|---|---|---|---|
Tiny | Small | Medium | Large | Huge | |||||
Kreuzberg Sync | 31.78 | 8.91 | 2.42 | — | — | 100.0% | None | 359.8 | 71MB |
Kreuzberg Async | 23.94 | 9.31 | 3.16 | — | — | 100.0% | None | 395.2 | 71MB |
Unstructured | 4.82 | 0.86 | 0.06 | — | — | 98.8% | 3 timeouts | 1345.8 | 146MB |
Extractous | 3.10 | 4.17 | 0.07 | — | — | 98.7% | 3 errors | 498.6 | ~100MB |
Docling | 0.26 | 0.07 | — | — | — | 98.5% | 3 errors | 1757.8 | 1GB+ |
Markitdown | 26.27 | 2.61 | — | — | — | 98.2% | 3 errors | 359.8 | 251MB |
Success rates calculated on files actually tested by each framework. "—" indicates categories not included in this benchmark run. License details available in the Framework Details section below.
Framework | Memory Usage by Category (MB) | Avg Memory (MB) | ||||
---|---|---|---|---|---|---|
Tiny | Small | Medium | Large | Huge | ||
Kreuzberg Sync | 348 | 352 | 379 | - | - | 360 |
Kreuzberg Async | 324 | 355 | 507 | - | - | 395 |
Unstructured | 952 | 1832 | 1253 | - | - | 1346 |
Extractous | 580 | 446 | 469 | - | - | 499 |
Docling | 1794 | 1721 | - | - | - | 1758 |
Markitdown | 343 | 377 | - | - | - | 360 |
Memory usage shown as peak RSS (Resident Set Size) in MB during extraction
Speed Analysis: Kreuzberg leads with 15+ files/sec, while Docling shows timeout issues on complex documents
Throughput Insights: Multi-format frameworks show consistent performance across diverse document types
Reliability Notes: Success rates calculated only on files each framework attempts to process
Interpretation: Shows average peak memory consumption across all file types. Lower bars indicate more memory-efficient frameworks.
Framework Behavior: Each framework shows distinct memory patterns per file type. Frameworks optimized for specific formats use significantly less memory on their target documents.
Size Scaling: Performance patterns change dramatically with document size. Memory usage can increase exponentially for complex documents regardless of file size.
Trade-offs: Larger installations often include ML models and extensive format support, while smaller frameworks focus on specific use cases.
Total: 18 different file formats across 6 categories
Format Diversity: Comprehensive testing across document types commonly encountered in real-world document intelligence scenarios.
Coverage Analysis: Shows percentage of metadata fields successfully extracted by each framework across all document types.
Field Analysis: Compares specific metadata field extraction capabilities across frameworks, highlighting strengths and gaps.
Quality Metrics: Combines extraction completeness, text coherence, semantic similarity, and document-specific quality checks.
Text Quality: Measures how well frameworks preserve readable, coherent text structure during extraction.
--enable-quality-assessment
flag to generate comprehensive quality metrics and visualizations.
License: MIT | Version: 3.8.1 | Size: 71MB base
Fast Python document intelligence with multiple OCR backends. Supports both sync and async APIs.
Strengths: Speed, small footprint, async support, comprehensive format coverage
Format Support: All tested formats except MSG (no open source support)
Commercial Use: ✅ Fully permissive MIT license
License: MIT | Version: 2.41.0 | Size: 1GB+
IBM Research's advanced document understanding with ML models.
Strengths: Advanced ML understanding, high quality
Format Support: PDF, DOCX, XLSX, PPTX, HTML, CSV, MD, AsciiDoc, Images
Commercial Use: ✅ Fully permissive MIT license
License: MIT | Version: 0.0.1a2 | Size: 251MB
Microsoft's lightweight Markdown converter optimized for LLM processing.
Strengths: LLM-optimized output, ONNX performance
Limitations: Limited format support
Commercial Use: ✅ Fully permissive MIT license
License: Apache 2.0 | Version: 0.18.5 | Size: 146MB
Enterprise solution supporting 64+ file types.
Strengths: Widest format support, enterprise features
Limitations: Moderate speed
Commercial Use: ✅ Permissive Apache 2.0 license
License: Apache 2.0 | Version: 0.1.0 | Size: ~100MB
Fast Rust-based extraction with Python bindings.
Strengths: Native performance, low memory usage
Format Support: Common office and web formats
Commercial Use: ✅ Permissive Apache 2.0 license
🌐 HTML Report | 📝 Markdown Report | 📊 JSON Metrics | 📊 Summary Data
Additional analysis modules are available in the detailed reports section above.
Specialized analysis of table detection and extraction capabilities across frameworks, focusing on structure preservation, cell accuracy, and formatting retention.
--table-extraction-only
flag to focus analysis on documents containing tables.
Detailed per-file-type performance data is available in the benchmark reports above.