1

I've got a large number of 3- and 6-column journal and newspaper pages that I want to OCR. I want to automate recognition of columns.

I've used tesseract (see a previous question) and Google Cloud Document AI (using the R package daiR) without great success.

These programs read the text very well, but do not do a good job of recognizing the column format of pages.

Here's a couple of examples from daiR:

image of text blocks on newspaper

Obviously these are complex images with some double columns and some tables inside columns. What I want is for the OCR to try to look for 6 columns.

I get good results if I preprocess images (for instance by cropping them into single columns or adding vertical lines), but I haven't found an efficient way to do this in large batches. Is there a way of preprocessing images or telling OCR programs to look for a given number of columns?

double-beep
  • 5,031
  • 17
  • 33
  • 41
Will Hanley
  • 457
  • 3
  • 16
  • can you please post the **original input images** before passing them to *daiR* – Bilal Jul 06 '21 at 21:49
  • 1
    Yes, they are on github: [p4](https://github.com/dig-eg-gaz/page-images/raw/master/1893-page-images-1/1893-01-02-p4.jpg) and [p3](https://github.com/dig-eg-gaz/page-images/raw/master/1893-page-images-1/1893-01-02-p3.jpg) – Will Hanley Jul 07 '21 at 14:25
  • FYI, Document AI has an actively monitored tag [`[cloud-document-ai]`](https://stackoverflow.com/questions/tagged/cloud-document-ai) --- Document AI doesn't support searching for a specific number of columns. – Holt Skinner Mar 28 '23 at 22:01

0 Answers0