PDF OCR Service - Enhanced with Comprehensive Indentation Detection & Intelligent Text Classification
Convert PDF documents to text using enhanced OCR with HTML intermediate processing, smart table handling, comprehensive indentation pattern recognition including parenthetical patterns like (1), (๑), (a), and intelligent text classification for headers, paragraphs, and list items
Instructions & Enhanced Features
How to Use:
- Upload PDF: Select your PDF file in the configuration panel below
- Choose Method: Select OCR method (Auto recommended for best results)
- Configure Crop (Optional): Enable header/footer removal and adjust crop settings
- Process: Click the process button to extract text with comprehensive indentation detection and text classification
- Download: Get results in TXT, DOCX, or HTML format with preserved formatting
Comprehensive Indentation Detection & Text Classification Features:
- Decimal: 1.1.1.1.1...
- Mixed: 1.2.a.i.A...
- Legal: 1.1.1(a)(i)
- Outline: I.A.1.a.i.
- Section: §1.2.3, Article 1.1.1
- Arabic: (1), (2), (3)
- Thai Numerals: (๑), (๒), (๓)
- Letters: (a), (b), (A), (B)
- Roman: (i), (ii), (I), (II)
- Thai Letters: (ก), (ข), (ค)
- Thai Script: มาตรา, ข้อ, ก.ข.ค.
- Bullets: •◦▪→ and 20+ more
- Roman: I.II.III, i.ii.iii
- Letters: A.B.C, a.b.c
- Checkboxes: [x], [ ], [✓]
- Header Detection: Title case, all caps, short lines
- Paragraph Recognition: Long text, proper punctuation
- List Item Identification: Patterned content
- Context Analysis: Position, font size, formatting
- Confidence Scoring: Reliability assessment
Technical Enhancements:
- Smart Table Detection: 70% overlap threshold prevents text loss
- HTML Processing: Better structure and formatting preservation
- Multi-format Export: TXT, DOCX, and HTML downloads with preserved indentation
- Advanced Crop Control: Per-page customization with real-time preview
- Enhanced Resolution: High-quality processing for better accuracy
- Document Analysis: Automatic structure detection and statistics
- Priority Pattern Matching: Intelligent pattern detection with priority ranking
- Text Classification: Automated header, paragraph, and list item detection
Configuration Panel
Choose OCR method (all enhanced with comprehensive indentation detection and text classification)
Auto Selection: Automatically chooses the best available method with comprehensive indentation detection, intelligent text classification, HTML processing, enhanced pattern recognition for hierarchical numbering (including parenthetical patterns like (1), (๑), (a)), bullets, and multi-language support.
Header/Footer Removal & Crop Settings
Remove headers and footers with high-resolution processing
Results & Downloads
Service Status & Capabilities
Available OCR Methods (Enhanced with Comprehensive Indentation Detection & Text Classification): ✅ Azure Document Intelligence - Ready (HTML + Tables + Comprehensive Indentation + Text Classification) ❌ Tesseract OCR - Not available ✅ PyMuPDF - Ready (HTML Enhanced + Comprehensive Indentation + Text Classification)
Comprehensive Indentation Detection Features: ✅ Hierarchical Decimal Numbering (1.1.1.1.1...) ✅ Mixed Hierarchical Numbering (1.2.a.i.A...) ✅ Legal Numbering (1.1.1(a)(i)) ✅ Outline Numbering (I.A.1.a.i.) ✅ Section Numbering (§1.2.3, Article 1.1.1) ✅ Parenthetical Arabic Numerals ((1), (2), (3)) ✅ Parenthetical Thai Numerals ((๑), (๒), (๓)) ✅ Parenthetical Letters ((a), (b), (A), (B)) ✅ Parenthetical Roman Numerals ((i), (ii), (I), (II)) ✅ Parenthetical Thai Letters ((ก), (ข), (ค)) ✅ Thai Script Support (มาตรา, ข้อ, ก.ข.ค.) ✅ Multiple Bullet Styles (•◦▪→ and more) ✅ Checkbox Items ([x], [ ], [✓]) ✅ Roman Numerals (I.II.III, i.ii.iii) ✅ Letter Lists (A.B.C, a.b.c) ✅ Space-based Indentation Detection ✅ Priority-based Pattern Matching
Intelligent Text Classification Features: ✅ Header Detection (title case, all caps, short lines) ✅ Paragraph Classification (long text, proper punctuation) ✅ List Item Recognition (patterned content) ✅ Context-aware Analysis (position, font size) ✅ Confidence Scoring ✅ Document Structure Analysis
Enhanced Processing Features: ✅ HTML Processing - Available ✅ Enhanced Table Handling - Available ✅ Smart Text Preservation - Available ✅ Multi-Page Crop Preview - Available ✅ Per-Page Crop Customization - Available ✅ Document Structure Analysis - Available ✅ Enhanced DOCX Export - Available (with indentation formatting) ✅ HTML File Export - Available ✅ Enhanced Text Export - Available ✅ Pattern Detection Engine - 26 patterns supported