PDF OCR Service - Enhanced with HTML Processing
Convert PDF documents to text using enhanced OCR with HTML intermediate processing, smart table handling, and format preservation
Instructions & Features
How to Use:
- Upload PDF: Select your PDF file in the configuration panel below
- Choose Method: Select OCR method (Auto recommended for best results)
- Configure Crop (Optional): Enable header/footer removal and adjust crop settings
- Process: Click the process button to extract text with HTML enhancement
- Download: Get results in TXT, DOCX, or HTML format
Enhanced Features:
- Smart Table Detection: 70% overlap threshold prevents text loss
- HTML Processing: Better structure and formatting preservation
- Multi-format Export: TXT, DOCX, and HTML downloads
- Advanced Crop Control: Per-page customization with real-time preview
- Enhanced Resolution: High-quality processing for better accuracy
- Page Numbers: Automatic page numbering in extracted content
- Proper Indentation: Preserved spacing and formatting
Configuration Panel
Choose OCR method (all enhanced with HTML processing)
Auto Selection: Automatically chooses the best available method with HTML processing and enhanced table handling.
Header/Footer Removal & Crop Settings
Remove headers and footers with high-resolution processing
Results & Downloads
Service Status
Available OCR Methods (Enhanced with HTML Processing): ✓ Azure Document Intelligence - Ready (HTML + Tables) ✗ Tesseract OCR - Not available ✓ PyMuPDF - Ready (HTML Enhanced) ✓ HTML Processing - Available ✓ Enhanced Table Handling - Available ✓ Smart Text Preservation - Available ✓ Multi-Page Crop Preview - Available ✓ Per-Page Crop Customization - Available ✓ Enhanced DOCX Export - Available ✓ HTML File Export - Available ✓ Enhanced Text Export - Available