I've got a large number of 3- and 6-column journal and newspaper pages that I want to OCR. I want to automate recognition of columns.
I've used tesseract
(see a previous question) and Google Cloud Document AI (using the R package daiR) without great success.
These programs read the text very well, but do not do a good job of recognizing the column format of pages.
Here's a couple of examples from daiR:
Obviously these are complex images with some double columns and some tables inside columns. What I want is for the OCR to try to look for 6 columns.
I get good results if I preprocess images (for instance by cropping them into single columns or adding vertical lines), but I haven't found an efficient way to do this in large batches. Is there a way of preprocessing images or telling OCR programs to look for a given number of columns?