PDF OCR Service - Enhanced with OpenCV Text Block Analysis & Bold Detection

Convert PDF documents to text using OpenCV-enhanced OCR with text block detection, bold text recognition, HTML intermediate processing, smart table handling, comprehensive indentation pattern recognition including parenthetical patterns like (1), (๑), (a), and intelligent text classification for headers, paragraphs, and list items

OpenCV Status: ✅ Available - Text block analysis and bold detection enabled

Instructions & OpenCV-Enhanced Features

How to Use:

Upload PDF: Select your PDF file in the configuration panel below
Choose Method: Select OCR method (Auto recommended for best results)
Configure Crop (Optional): Enable header/footer removal and adjust crop settings with OpenCV visualization
Process: Click the process button to extract text with OpenCV text block analysis and bold detection
Download: Get results in TXT, DOCX, or HTML format with preserved formatting and header detection

OpenCV Text Block Analysis & Bold Detection Features:

OpenCV Enhancements ✅:

Text Block Detection & Analysis
Entire Line Bold Text Recognition for Headers
Automatic Spacing & Paragraph Detection
Visual Text Element Analysis
Header Indentation Suppression
Real-time Text Block Overlay
4-Space Indentation System

Hierarchical Numbering:

Decimal: 1.1.1.1.1...
Mixed: 1.2.a.i.A...
Legal: 1.1.1(a)(i)
Outline: I.A.1.a.i.
Section: §1.2.3, Article 1.1.1

Parenthetical Patterns:

Arabic: (1), (2), (3)
Thai Numerals: (๑), (๒), (๓)
Letters: (a), (b), (A), (B)
Roman: (i), (ii), (I), (II)
Thai Letters: (ก), (ข), (ค)

Multi-Language & Symbols:

Thai Script: มาตรา, ข้อ, ก.ข.ค.
Bullets: •◦▪→ and 20+ more
Roman: I.II.III, i.ii.iii
Letters: A.B.C, a.b.c
Checkboxes: [x], [ ], [✓]

Intelligent Text Classification:

Header Detection: Title case, all caps, short lines
Paragraph Recognition: Long text, proper punctuation
List Item Identification: Patterned content
Context Analysis: Position, font size, formatting
Confidence Scoring: Reliability assessment
OpenCV Bold Detection: Enabled

Technical Enhancements:

OpenCV Text Block Analysis: Enabled
Bold Text Recognition: Enabled
Header Indentation Suppression: Enabled
Smart Table Detection: 70% overlap threshold prevents text loss
HTML Processing: Better structure and formatting preservation
Multi-format Export: TXT, DOCX, and HTML downloads with preserved indentation

Configuration Panel

Upload PDF File

PDF Status

OCR Method

Choose OCR method (all enhanced with OpenCV ✅ + comprehensive indentation detection and text classification)

Auto Selection: Automatically chooses the best available method with OpenCV Text Block Analysis & Bold Detection, comprehensive indentation detection, intelligent text classification, HTML processing, enhanced pattern recognition for hierarchical numbering (including parenthetical patterns like (1), (๑), (a)), bullets, and multi-language support with header indentation suppression.

Header/Footer Removal & Crop Settings with OpenCV Enhancement

Remove headers and footers with high-resolution processing + OpenCV text block analysis

Enable Enhanced Header/Footer Removal with OpenCV Analysis

Results & Downloads

Processing Status

Extracted Text (OpenCV ✅ Enhanced with Comprehensive Indentation Detection & Text Classification)

Processing Information & Document Analysis

Service Status & OpenCV-Enhanced Capabilities

Available OCR Methods (Enhanced with OpenCV Text Block Analysis & Bold Detection): ✅ Azure Document Intelligence - Ready (HTML + Tables + Comprehensive Indentation + Text Classification + OpenCV Enhanced) ❌ Tesseract OCR - Not available ✅ PyMuPDF - Ready (HTML Enhanced + Comprehensive Indentation + Text Classification + OpenCV Enhanced)

OpenCV Text Block Analysis & Bold Detection Features: ✅ Text Block Detection & Analysis ✅ Bold Text Recognition for Headers ✅ Automatic Spacing & Paragraph Detection ✅ Visual Text Element Analysis ✅ Header Indentation Suppression ✅ Real-time Crop Preview with Text Overlay ✅ Enhanced High-Resolution Processing

Comprehensive Indentation Detection Features: ✅ Hierarchical Decimal Numbering (1.1.1.1.1...) ✅ Mixed Hierarchical Numbering (1.2.a.i.A...) ✅ Legal Numbering (1.1.1(a)(i)) ✅ Outline Numbering (I.A.1.a.i.) ✅ Section Numbering (§1.2.3, Article 1.1.1) ✅ Parenthetical Arabic Numerals ((1), (2), (3)) ✅ Parenthetical Thai Numerals ((๑), (๒), (๓)) ✅ Parenthetical Letters ((a), (b), (A), (B)) ✅ Parenthetical Roman Numerals ((i), (ii), (I), (II)) ✅ Parenthetical Thai Letters ((ก), (ข), (ค)) ✅ Thai Script Support (มาตรา, ข้อ, ก.ข.ค.) ✅ Multiple Bullet Styles (•◦▪→ and more) ✅ Checkbox Items ([x], [ ], [✓]) ✅ Roman Numerals (I.II.III, i.ii.iii) ✅ Letter Lists (A.B.C, a.b.c) ✅ Space-based Indentation Detection ✅ Priority-based Pattern Matching

Intelligent Text Classification Features: ✅ Header Detection (title case, all caps, short lines) ✅ Paragraph Classification (long text, proper punctuation) ✅ List Item Recognition (patterned content) ✅ Context-aware Analysis (position, font size) ✅ Confidence Scoring ✅ Document Structure Analysis ✅ OpenCV-Enhanced Bold Text Detection ✅ Header Indentation Suppression

Enhanced Processing Features: ✅ HTML Processing - Available ✅ Enhanced Table Handling - Available ✅ Smart Text Preservation - Available ✅ Multi-Page Crop Preview - Available ✅ Per-Page Crop Customization - Available ✅ Document Structure Analysis - Available ✅ OpenCV Text Block Overlay - Available ✅ Bold Text Visualization - Available ✅ Enhanced DOCX Export - Available (with OpenCV-enhanced indentation formatting) ✅ HTML File Export - Available ✅ Enhanced Text Export - Available ✅ Pattern Detection Engine - 26 patterns supported