Supported Image Formats for OCR

OCR (Optical Character Recognition) technology is designed to read and extract text from digital image files. However, not all image formats are equally suitable for OCR processing. Choosing the right format can dramatically affect recognition accuracy, processing speed, and output quality. On this page, we explore the most commonly supported image formats, their pros and cons, and best practices when preparing files for OCR.

1. Popular OCR-Compatible Formats

Our OCR system supports a wide range of image formats, including the following:

2. Recommended Formats for High Accuracy

For optimal OCR results, we recommend using:

3. File Size Considerations

File size impacts performance. While large files provide more detail, they also increase loading and processing time. It's best to:

4. Multi-Page and Complex Files

OCR engines can process multi-page formats like PDFs and TIFFs, but results may vary based on layout complexity and resolution. Always preview the output to ensure all pages are correctly read and interpreted.

5. Unsupported or Problematic Formats

Some formats may not yield good OCR results due to their nature:

Before uploading, convert such formats to PNG or JPG for better compatibility.

6. Preparing Files for OCR

Regardless of the format, follow these steps to optimize the file before uploading:

Conclusion

Choosing the right file format is a crucial part of successful OCR. While our system accepts most image types, using high-quality PNG or JPG images with clean text and proper resolution will consistently produce the best results. Always prepare your files with OCR optimization in mind to get fast, accurate, and reliable text extraction.

Advanced Guide: Formats, Containers, Color Models, and Rendering

Picking the right format is about more than “what opens in a browser.” It determines how much real text detail reaches the recognizer, how stable colors and edges remain after compression, and whether layout nuances (vectors, transparency, overprint) survive rasterization. The sections below give practical, production-ready guidance you can apply without changing your site’s design or scripts.

1) Raster vs. Vector: When to Rasterize

OCR works on pixels. If your source is vector (PDF with embedded text/paths), you face a choice: try to extract the text layer directly or rasterize to an image first. Prefer direct text extraction when fonts are embedded and the text layer is intact; switch to rasterization when the PDF is just scanned pages, when glyphs are curves (Type 3 outlines), or when encoding/CMap is broken. For rasterization, target a line x-height of roughly 20–30 pixels (often ≈300–400 DPI for common A4/Letter scans). Downscaling below this range erases fine serifs; oversizing wastes memory without adding legibility.

2) Compression Families and Their Artifacts

3) Color Models and Subsampling

Most camera JPEGs store luma/chroma as YCbCr with subsampling (4:2:0 or 4:2:2). Subsampling blurs color edges—fine for photos, risky for colored text on colored backgrounds. Prefer no subsampling (4:4:4) for small colored fonts, or use PNG. For print assets, CMYK images may appear dull if converted naively to sRGB; apply ICC-aware conversion. Grayscale is often best for black-text documents: it reduces noise and file size without color artifacts.

4) ICC Profiles, Gamma, and Black Point

Always respect embedded ICC profiles. Convert everything to a working space (sRGB for the web) with black-point compensation to avoid crushed shadows. Gamma-correct resampling (do not scale in linear space for displays) keeps edges consistent. When profiles are missing, assume sRGB for browser workflows, but be cautious with legacy scans tagged as “DeviceGray/DeviceCMYK.”

5) Alpha and Premultiplication

Transparent assets (PNGs with logos, stamps) may be premultiplied. During compositing to a white page, use the correct alpha mode to prevent gray fringes around glyphs. If you see halos, you likely blended un-premultiplied data or mismatched gamma. Flatten to an opaque background before OCR if transparency is only decorative.

6) PDF Internals that Affect OCR

7) DPI, Scaling, and Resampling Filters

For scans, 300 DPI is a good default; 200 DPI is often acceptable for large fonts; 400–600 DPI helps tiny footnotes or dot-matrix prints. When resizing:

8) Binarization and Quantization

Adaptive thresholding (e.g., Sauvola/Wolf) separates text from uneven backgrounds. Apply gently—over-binarization breaks thin strokes and diacritics. For bilevel formats, verify the result at 100% zoom and prefer anti-aliasing for small text to preserve stroke intent.

9) Multi-Page Containers: TIFF and PDF

Multi-page TIFF is efficient for archival workflows; PDF is friendlier for viewing/sharing. For OCR, both work—ensure consistent per-page DPI and metadata. If your PDF mixes resolutions, rasterize per page at a target x-height rather than one global DPI.

10) Metadata and Provenance

Keep EXIF/XMP or PDF metadata when possible: DPI, orientation, capture device, and software chain. Provenance accelerates triage when results regress. If privacy rules restrict metadata, retain a minimal manifest (hash, approximate DPI, color model) alongside exports.

11) Fonts, Hinting, and Small Text

Pixel-aligned system fonts in screenshots may show color fringing from subpixel rendering. Convert to grayscale (or desaturate) before OCR to remove subpixel artifacts. For tiny UI text, aim for an x-height near 20–24 px after scaling; below ~14 px, recognition quality drops steeply.

12) Tables, Lines, and Halftones

Halftone photographs inside documents confuse binarizers. Use selective processing: mask or soften halftones, but keep edges around text and table rules crisp. For tables, run a mild morphological open/close pass to solidify broken grid lines before segmentation.

13) Web-Friendly Rendering Paths

In the browser, prefer <canvas> with OffscreenCanvas/Web Workers for heavy steps. Tile large pages (e.g., 1024×1024) to keep memory bounded; render vector PDFs to tiles and stitch. Avoid repeated encode/decode cycles—hold intermediate bitmaps in memory and compress once on export.

14) Locale and Color Conventions

Locale affects more than text: currency marks, decimal separators, date glyphs, and even color conventions in highlighted text. Keep color-to-mono conversions consistent so emphasized tokens (e.g., red warnings) do not vanish into the background.

15) Safety and Integrity Notes

16) Export Targets for Downstream Use

Choose export by workflow: plain text for quick copy; JSON with region coordinates for automation; CSV for tables; and ALTO XML or hOCR when you need reading order and layout semantics. Always include page DPI, color model, and a per-page checksum in the export to make debugging reproducible.

17) Practical Checklists

Summary

Format, container, and color decisions directly shape OCR quality. Prefer lossless or lightly compressed inputs for text; honor ICC profiles; rasterize vectors at a target x-height; and export with metadata that preserves provenance. These habits keep edges clean, colors predictable, and pages faithful—so recognition is both accurate and repeatable.