Set default column count

Question

I've got a large number of 3- and 6-column journal and newspaper pages that I want to OCR. I want to automate recognition of columns.

I've used tesseract (see a previous question) and Google Cloud Document AI (using the R package daiR) without great success.

These programs read the text very well, but do not do a good job of recognizing the column format of pages.

Here's a couple of examples from daiR:

Obviously these are complex images with some double columns and some tables inside columns. What I want is for the OCR to try to look for 6 columns.

I get good results if I preprocess images (for instance by cropping them into single columns or adding vertical lines), but I haven't found an efficient way to do this in large batches. Is there a way of preprocessing images or telling OCR programs to look for a given number of columns?

can you please post the **original input images** before passing them to *daiR* — Bilal, Jul 06 '21 at 21:49
Yes, they are on github: [p4](https://github.com/dig-eg-gaz/page-images/raw/master/1893-page-images-1/1893-01-02-p4.jpg) and [p3](https://github.com/dig-eg-gaz/page-images/raw/master/1893-page-images-1/1893-01-02-p3.jpg) — Will Hanley, Jul 07 '21 at 14:25
FYI, Document AI has an actively monitored tag [`[cloud-document-ai]`](https://stackoverflow.com/questions/tagged/cloud-document-ai) --- Document AI doesn't support searching for a specific number of columns. — Holt Skinner, Mar 28 '23 at 22:01

Set default column count

0 Answers0