PDF OCR Service - Enhanced with Comprehensive Indentation Detection & Intelligent Text Classification

Convert PDF documents to text using enhanced OCR with HTML intermediate processing, smart table handling, comprehensive indentation pattern recognition including parenthetical patterns like (1), (๑), (a), and intelligent text classification for headers, paragraphs, and list items

Instructions & Enhanced Features

How to Use:

  1. Upload PDF: Select your PDF file in the configuration panel below
  2. Choose Method: Select OCR method (Auto recommended for best results)
  3. Configure Crop (Optional): Enable header/footer removal and adjust crop settings
  4. Process: Click the process button to extract text with comprehensive indentation detection and text classification
  5. Download: Get results in TXT, DOCX, or HTML format with preserved formatting

Comprehensive Indentation Detection & Text Classification Features:

Hierarchical Numbering:
  • Decimal: 1.1.1.1.1...
  • Mixed: 1.2.a.i.A...
  • Legal: 1.1.1(a)(i)
  • Outline: I.A.1.a.i.
  • Section: §1.2.3, Article 1.1.1
Parenthetical Patterns:
  • Arabic: (1), (2), (3)
  • Thai Numerals: (๑), (๒), (๓)
  • Letters: (a), (b), (A), (B)
  • Roman: (i), (ii), (I), (II)
  • Thai Letters: (ก), (ข), (ค)
Multi-Language & Symbols:
  • Thai Script: มาตรา, ข้อ, ก.ข.ค.
  • Bullets: •◦▪→ and 20+ more
  • Roman: I.II.III, i.ii.iii
  • Letters: A.B.C, a.b.c
  • Checkboxes: [x], [ ], [✓]
Intelligent Text Classification:
  • Header Detection: Title case, all caps, short lines
  • Paragraph Recognition: Long text, proper punctuation
  • List Item Identification: Patterned content
  • Context Analysis: Position, font size, formatting
  • Confidence Scoring: Reliability assessment

Technical Enhancements:

  • Smart Table Detection: 70% overlap threshold prevents text loss
  • HTML Processing: Better structure and formatting preservation
  • Multi-format Export: TXT, DOCX, and HTML downloads with preserved indentation
  • Advanced Crop Control: Per-page customization with real-time preview
  • Enhanced Resolution: High-quality processing for better accuracy
  • Document Analysis: Automatic structure detection and statistics
  • Priority Pattern Matching: Intelligent pattern detection with priority ranking
  • Text Classification: Automated header, paragraph, and list item detection

Configuration Panel

OCR Method

Choose OCR method (all enhanced with comprehensive indentation detection and text classification)

Auto Selection: Automatically chooses the best available method with comprehensive indentation detection, intelligent text classification, HTML processing, enhanced pattern recognition for hierarchical numbering (including parenthetical patterns like (1), (๑), (a)), bullets, and multi-language support.

Header/Footer Removal & Crop Settings

Remove headers and footers with high-resolution processing

Results & Downloads

Service Status & Capabilities

Available OCR Methods (Enhanced with Comprehensive Indentation Detection & Text Classification): ✅ Azure Document Intelligence - Ready (HTML + Tables + Comprehensive Indentation + Text Classification) ❌ Tesseract OCR - Not available ✅ PyMuPDF - Ready (HTML Enhanced + Comprehensive Indentation + Text Classification)

Comprehensive Indentation Detection Features: ✅ Hierarchical Decimal Numbering (1.1.1.1.1...) ✅ Mixed Hierarchical Numbering (1.2.a.i.A...) ✅ Legal Numbering (1.1.1(a)(i)) ✅ Outline Numbering (I.A.1.a.i.) ✅ Section Numbering (§1.2.3, Article 1.1.1) ✅ Parenthetical Arabic Numerals ((1), (2), (3)) ✅ Parenthetical Thai Numerals ((๑), (๒), (๓)) ✅ Parenthetical Letters ((a), (b), (A), (B)) ✅ Parenthetical Roman Numerals ((i), (ii), (I), (II)) ✅ Parenthetical Thai Letters ((ก), (ข), (ค)) ✅ Thai Script Support (มาตรา, ข้อ, ก.ข.ค.) ✅ Multiple Bullet Styles (•◦▪→ and more) ✅ Checkbox Items ([x], [ ], [✓]) ✅ Roman Numerals (I.II.III, i.ii.iii) ✅ Letter Lists (A.B.C, a.b.c) ✅ Space-based Indentation Detection ✅ Priority-based Pattern Matching

Intelligent Text Classification Features: ✅ Header Detection (title case, all caps, short lines) ✅ Paragraph Classification (long text, proper punctuation) ✅ List Item Recognition (patterned content) ✅ Context-aware Analysis (position, font size) ✅ Confidence Scoring ✅ Document Structure Analysis

Enhanced Processing Features: ✅ HTML Processing - Available ✅ Enhanced Table Handling - Available ✅ Smart Text Preservation - Available ✅ Multi-Page Crop Preview - Available ✅ Per-Page Crop Customization - Available ✅ Document Structure Analysis - Available ✅ Enhanced DOCX Export - Available (with indentation formatting) ✅ HTML File Export - Available ✅ Enhanced Text Export - Available ✅ Pattern Detection Engine - 26 patterns supported