Quality Assessment¶

Overview¶

Quality assessment evaluates the completeness and accuracy of extracted text, enabled by default in all benchmarks.

Assessment Criteria¶

Text Completeness¶

Full Text Extraction: All visible text is extracted
Hidden Text: Extraction of comments, annotations, metadata
Special Characters: Proper handling of Unicode, symbols
Language Support: Multi-language text extraction

Structure Preservation¶

Paragraph Boundaries: Maintaining text flow
List Formatting: Preserving bullet points and numbering
Table Structure: Extracting tabular data correctly
Header/Footer: Identifying document sections

Metadata Extraction¶

Document Properties: Title, author, creation date
Format-Specific Metadata: PDF info, EXIF data
Embedded Resources: Images, attachments references

Scoring Algorithm¶

Quality scores are calculated using:

Reference Comparison: When available, compare against known good extraction
Heuristic Analysis: Check for common extraction issues
Format Validation: Ensure output matches expected format
Content Verification: Validate extracted content makes sense

Quality Grades¶

Excellent (90-100): Near-perfect extraction
Good (80-89): Minor issues, usable output
Fair (70-79): Some problems, mostly usable
Poor (60-69): Significant issues, limited use
Failed (<60): Unusable extraction

Framework Comparisons¶

Quality varies by: - File Format: Some frameworks excel at specific formats - File Complexity: Performance degrades with complexity - OCR Requirements: Image-based text affects quality - Language: Non-English text may impact scores