PDF OCR Service - Enhanced with OpenCV Text Block Analysis & Bold Detection

Convert PDF documents to text using OpenCV-enhanced OCR with text block detection, bold text recognition, HTML intermediate processing, smart table handling, comprehensive indentation pattern recognition including parenthetical patterns like (1), (๑), (a), and intelligent text classification for headers, paragraphs, and list items

OpenCV Status: ✅ Available - Text block analysis and bold detection enabled

Instructions & OpenCV-Enhanced Features

How to Use:

  1. Upload PDF: Select your PDF file in the configuration panel below
  2. Choose Method: Select OCR method (Auto recommended for best results)
  3. Configure Crop (Optional): Enable header/footer removal and adjust crop settings with OpenCV visualization
  4. Process: Click the process button to extract text with OpenCV text block analysis and bold detection
  5. Download: Get results in TXT, DOCX, or HTML format with preserved formatting and header detection

OpenCV Text Block Analysis & Bold Detection Features:

OpenCV Enhancements ✅:
  • Text Block Detection & Analysis
  • Entire Line Bold Text Recognition for Headers
  • Automatic Spacing & Paragraph Detection
  • Visual Text Element Analysis
  • Header Indentation Suppression
  • Real-time Text Block Overlay
  • 4-Space Indentation System
Hierarchical Numbering:
  • Decimal: 1.1.1.1.1...
  • Mixed: 1.2.a.i.A...
  • Legal: 1.1.1(a)(i)
  • Outline: I.A.1.a.i.
  • Section: §1.2.3, Article 1.1.1
Parenthetical Patterns:
  • Arabic: (1), (2), (3)
  • Thai Numerals: (๑), (๒), (๓)
  • Letters: (a), (b), (A), (B)
  • Roman: (i), (ii), (I), (II)
  • Thai Letters: (ก), (ข), (ค)
Multi-Language & Symbols:
  • Thai Script: มาตรา, ข้อ, ก.ข.ค.
  • Bullets: •◦▪→ and 20+ more
  • Roman: I.II.III, i.ii.iii
  • Letters: A.B.C, a.b.c
  • Checkboxes: [x], [ ], [✓]
Intelligent Text Classification:
  • Header Detection: Title case, all caps, short lines
  • Paragraph Recognition: Long text, proper punctuation
  • List Item Identification: Patterned content
  • Context Analysis: Position, font size, formatting
  • Confidence Scoring: Reliability assessment
  • OpenCV Bold Detection: Enabled
Technical Enhancements:
  • OpenCV Text Block Analysis: Enabled
  • Bold Text Recognition: Enabled
  • Header Indentation Suppression: Enabled
  • Smart Table Detection: 70% overlap threshold prevents text loss
  • HTML Processing: Better structure and formatting preservation
  • Multi-format Export: TXT, DOCX, and HTML downloads with preserved indentation

Configuration Panel

OCR Method

Choose OCR method (all enhanced with OpenCV ✅ + comprehensive indentation detection and text classification)

Auto Selection: Automatically chooses the best available method with OpenCV Text Block Analysis & Bold Detection, comprehensive indentation detection, intelligent text classification, HTML processing, enhanced pattern recognition for hierarchical numbering (including parenthetical patterns like (1), (๑), (a)), bullets, and multi-language support with header indentation suppression.

Header/Footer Removal & Crop Settings with OpenCV Enhancement

Remove headers and footers with high-resolution processing + OpenCV text block analysis

Results & Downloads

Service Status & OpenCV-Enhanced Capabilities

Available OCR Methods (Enhanced with OpenCV Text Block Analysis & Bold Detection): ✅ Azure Document Intelligence - Ready (HTML + Tables + Comprehensive Indentation + Text Classification + OpenCV Enhanced) ❌ Tesseract OCR - Not available ✅ PyMuPDF - Ready (HTML Enhanced + Comprehensive Indentation + Text Classification + OpenCV Enhanced)

OpenCV Text Block Analysis & Bold Detection Features: ✅ Text Block Detection & Analysis ✅ Bold Text Recognition for Headers ✅ Automatic Spacing & Paragraph Detection ✅ Visual Text Element Analysis ✅ Header Indentation Suppression ✅ Real-time Crop Preview with Text Overlay ✅ Enhanced High-Resolution Processing

Comprehensive Indentation Detection Features: ✅ Hierarchical Decimal Numbering (1.1.1.1.1...) ✅ Mixed Hierarchical Numbering (1.2.a.i.A...) ✅ Legal Numbering (1.1.1(a)(i)) ✅ Outline Numbering (I.A.1.a.i.) ✅ Section Numbering (§1.2.3, Article 1.1.1) ✅ Parenthetical Arabic Numerals ((1), (2), (3)) ✅ Parenthetical Thai Numerals ((๑), (๒), (๓)) ✅ Parenthetical Letters ((a), (b), (A), (B)) ✅ Parenthetical Roman Numerals ((i), (ii), (I), (II)) ✅ Parenthetical Thai Letters ((ก), (ข), (ค)) ✅ Thai Script Support (มาตรา, ข้อ, ก.ข.ค.) ✅ Multiple Bullet Styles (•◦▪→ and more) ✅ Checkbox Items ([x], [ ], [✓]) ✅ Roman Numerals (I.II.III, i.ii.iii) ✅ Letter Lists (A.B.C, a.b.c) ✅ Space-based Indentation Detection ✅ Priority-based Pattern Matching

Intelligent Text Classification Features: ✅ Header Detection (title case, all caps, short lines) ✅ Paragraph Classification (long text, proper punctuation) ✅ List Item Recognition (patterned content) ✅ Context-aware Analysis (position, font size) ✅ Confidence Scoring ✅ Document Structure Analysis ✅ OpenCV-Enhanced Bold Text Detection ✅ Header Indentation Suppression

Enhanced Processing Features: ✅ HTML Processing - Available ✅ Enhanced Table Handling - Available ✅ Smart Text Preservation - Available ✅ Multi-Page Crop Preview - Available ✅ Per-Page Crop Customization - Available ✅ Document Structure Analysis - Available ✅ OpenCV Text Block Overlay - Available ✅ Bold Text Visualization - Available ✅ Enhanced DOCX Export - Available (with OpenCV-enhanced indentation formatting) ✅ HTML File Export - Available ✅ Enhanced Text Export - Available ✅ Pattern Detection Engine - 26 patterns supported