Supported Image Formats for OCR
OCR (Optical Character Recognition) technology is designed to read and extract text from digital image files. However, not all image formats are equally suitable for OCR processing. Choosing the right format can dramatically affect recognition accuracy, processing speed, and output quality. On this page, we explore the most commonly supported image formats, their pros and cons, and best practices when preparing files for OCR.
1. Popular OCR-Compatible Formats
Our OCR system supports a wide range of image formats, including the following:
- JPG / JPEG: One of the most widely used formats, offering decent quality with small file size. Ideal for scanned documents and photos.
- PNG: A lossless format that preserves image detail. Great for screenshots or images with sharp contrast text.
- BMP: An uncompressed format that stores all image data. Provides high fidelity but results in large file sizes.
- GIF: Limited to 256 colors and generally not recommended for detailed text, but supported for basic OCR tasks.
- WEBP: A modern format with both lossless and lossy compression. Supported in most browsers and compatible with OCR engines.
- PDF (Image-based): Though not an image file per se, PDFs containing scanned pages are processed like images.
2. Recommended Formats for High Accuracy
For optimal OCR results, we recommend using:
- PNG for clarity and detail, especially for screenshots or simple documents.
- JPG at 300 DPI or higher for scanned documents.
- BMP if file size is not an issue and maximum quality is desired.
3. File Size Considerations
File size impacts performance. While large files provide more detail, they also increase loading and processing time. It's best to:
- Keep files under 10MB when possible
- Use compression wisely (avoid over-compression)
- Ensure no text-blurring due to lossy formats
4. Multi-Page and Complex Files
OCR engines can process multi-page formats like PDFs and TIFFs, but results may vary based on layout complexity and resolution. Always preview the output to ensure all pages are correctly read and interpreted.
5. Unsupported or Problematic Formats
Some formats may not yield good OCR results due to their nature:
- HEIC: Common in iPhones, but not universally supported
- TIFF (multi-layered): Requires specific decoders
- Low-resolution thumbnails: Often too blurry for accurate recognition
6. Preparing Files for OCR
Regardless of the format, follow these steps to optimize the file before uploading:
- Crop unnecessary borders or background
- Enhance brightness and contrast
- Rotate or deskew to align text horizontally
- Use grayscale or black-and-white for best results
Conclusion
Choosing the right file format is a crucial part of successful OCR. While our system accepts most image types, using high-quality PNG or JPG images with clean text and proper resolution will consistently produce the best results. Always prepare your files with OCR optimization in mind to get fast, accurate, and reliable text extraction.
Advanced Guide: Formats, Containers, Color Models, and Rendering
Picking the right format is about more than “what opens in a browser.” It determines how much real text detail reaches the recognizer, how stable colors and edges remain after compression, and whether layout nuances (vectors, transparency, overprint) survive rasterization. The sections below give practical, production-ready guidance you can apply without changing your site’s design or scripts.
1) Raster vs. Vector: When to Rasterize
OCR works on pixels. If your source is vector (PDF with embedded text/paths), you face a choice: try to extract the text layer directly or rasterize to an image first. Prefer direct text extraction when fonts are embedded and the text layer is intact; switch to rasterization when the PDF is just scanned pages, when glyphs are curves (Type 3 outlines), or when encoding/CMap is broken. For rasterization, target a line x-height of roughly 20–30 pixels (often ≈300–400 DPI for common A4/Letter scans). Downscaling below this range erases fine serifs; oversizing wastes memory without adding legibility.
2) Compression Families and Their Artifacts
- Lossless (PNG, TIFF/Deflate): Preserves edges and single-pixel stems—ideal for UI text, screenshots, and simple black-on-white pages.
- Lossy (JPEG/WebP lossy): Introduces ringing and blocking near sharp edges. If you must use it, keep higher quality (low compression), avoid multiple save cycles, and prefer WebP over very compressed JPEG for small text.
- Bi-level (TIFF Group 4, JBIG2): Tiny files for scans with pure black text. Beware aggressive symbol substitution (historically with JBIG2) that can alter digits—prefer safer settings or lossless alternatives for legal/financial records.
- Animated/Indexed (GIF): 256-color palette and dithering patterns can hurt thin strokes; convert to PNG before OCR.
3) Color Models and Subsampling
Most camera JPEGs store luma/chroma as YCbCr with subsampling (4:2:0 or 4:2:2). Subsampling blurs color edges—fine for photos, risky for colored text on colored backgrounds. Prefer no subsampling (4:4:4) for small colored fonts, or use PNG. For print assets, CMYK images may appear dull if converted naively to sRGB; apply ICC-aware conversion. Grayscale is often best for black-text documents: it reduces noise and file size without color artifacts.
4) ICC Profiles, Gamma, and Black Point
Always respect embedded ICC profiles. Convert everything to a working space (sRGB for the web) with black-point compensation to avoid crushed shadows. Gamma-correct resampling (do not scale in linear space for displays) keeps edges consistent. When profiles are missing, assume sRGB for browser workflows, but be cautious with legacy scans tagged as “DeviceGray/DeviceCMYK.”
5) Alpha and Premultiplication
Transparent assets (PNGs with logos, stamps) may be premultiplied. During compositing to a white page, use the correct alpha mode to prevent gray fringes around glyphs. If you see halos, you likely blended un-premultiplied data or mismatched gamma. Flatten to an opaque background before OCR if transparency is only decorative.
6) PDF Internals that Affect OCR
- Boxes and Rotation: Honor
/CropBox
,/MediaBox
, and page/Rotate
. Ignoring them misaligns columns. - Image XObjects: A page can hold multiple images at different scales. Choose the highest-resolution source rather than a downsampled thumbnail.
- Blend/Overprint: Transparency groups and overprint may hide or lighten text. Render with overprint simulation when present.
- Fonts: Embedded CID fonts are ideal. If fonts are missing and you only see outlines, treat as raster input.
7) DPI, Scaling, and Resampling Filters
For scans, 300 DPI is a good default; 200 DPI is often acceptable for large fonts; 400–600 DPI helps tiny footnotes or dot-matrix prints. When resizing:
- Downscale with a high-quality filter that preserves edges (e.g., Lanczos or bicubic with sharpening).
- Upscaling rarely recovers detail; use it only to standardize x-height for the recognizer.
- Perform operations in 16-bit where possible to avoid banding in aggressive contrast stretches.
8) Binarization and Quantization
Adaptive thresholding (e.g., Sauvola/Wolf) separates text from uneven backgrounds. Apply gently—over-binarization breaks thin strokes and diacritics. For bilevel formats, verify the result at 100% zoom and prefer anti-aliasing for small text to preserve stroke intent.
9) Multi-Page Containers: TIFF and PDF
Multi-page TIFF is efficient for archival workflows; PDF is friendlier for viewing/sharing. For OCR, both work—ensure consistent per-page DPI and metadata. If your PDF mixes resolutions, rasterize per page at a target x-height rather than one global DPI.
10) Metadata and Provenance
Keep EXIF/XMP or PDF metadata when possible: DPI, orientation, capture device, and software chain. Provenance accelerates triage when results regress. If privacy rules restrict metadata, retain a minimal manifest (hash, approximate DPI, color model) alongside exports.
11) Fonts, Hinting, and Small Text
Pixel-aligned system fonts in screenshots may show color fringing from subpixel rendering. Convert to grayscale (or desaturate) before OCR to remove subpixel artifacts. For tiny UI text, aim for an x-height near 20–24 px after scaling; below ~14 px, recognition quality drops steeply.
12) Tables, Lines, and Halftones
Halftone photographs inside documents confuse binarizers. Use selective processing: mask or soften halftones, but keep edges around text and table rules crisp. For tables, run a mild morphological open/close pass to solidify broken grid lines before segmentation.
13) Web-Friendly Rendering Paths
In the browser, prefer <canvas>
with OffscreenCanvas/Web Workers for heavy steps. Tile large pages (e.g., 1024×1024) to keep memory bounded; render vector PDFs to tiles and stitch. Avoid repeated encode/decode cycles—hold intermediate bitmaps in memory and compress once on export.
14) Locale and Color Conventions
Locale affects more than text: currency marks, decimal separators, date glyphs, and even color conventions in highlighted text. Keep color-to-mono conversions consistent so emphasized tokens (e.g., red warnings) do not vanish into the background.
15) Safety and Integrity Notes
- Avoid extreme JPEG recompression of evidence documents; keep a “source” copy.
- When converting HEIC/RAW, resolve orientation flags and embed ICC on export.
- Do not strip transparency if it carries meaning (e.g., stamps/watermarks used for validation).
16) Export Targets for Downstream Use
Choose export by workflow: plain text for quick copy; JSON with region coordinates for automation; CSV for tables; and ALTO XML or hOCR when you need reading order and layout semantics. Always include page DPI, color model, and a per-page checksum in the export to make debugging reproducible.
17) Practical Checklists
- For screenshots/UI text → PNG (lossless), sRGB, no subsampling.
- For scanned print → JPEG/WebP at high quality or TIFF/PNG; target 300 DPI; ensure proper rotation.
- For small colored fonts → 4:4:4 sampling or PNG; avoid heavy JPEG subsampling.
- For archival multipage → PDF/A or TIFF multipage with consistent DPI and embedded profiles.
- Before OCR → deskew, gentle denoise, adaptive threshold only if needed, preserve stems and diacritics.
Summary
Format, container, and color decisions directly shape OCR quality. Prefer lossless or lightly compressed inputs for text; honor ICC profiles; rasterize vectors at a target x-height; and export with metadata that preserves provenance. These habits keep edges clean, colors predictable, and pages faithful—so recognition is both accurate and repeatable.