I'm using OCR on historical newspapers that contain 6 columns per page. At present I use FineReader and define text blocks for each column. I'd like to use Tesseract. Tesseract gets the columns mostly right, but every few lines it reads into adjacent columns. I wonder if there's a way to set its parameters so that it will look quite rigidly for six columns.
Following suggestions on other questions, I've tried playing with --psm
and hocr without great success.
Working with a jpg I've posted on github, and converting it into a text-embedded pdf using this code tesseract 1906-07-02-p4.jpg out -l eng+fra --psm 1 pdf
I get this result:
Clearly the engine is making a bloc containing the indented lines, and another containing the flush lines.
Confirming this is the text output of the flush lines:
Grocery, Bar and Coffea shop of the trpops
stationed at the Citadel, Cairo.
to received tender for this service by 10 a.m.,
on Saturday, the 14th Jaly, 1906.
application in person to the Commandant,
Citadel, between the hours of 10 a.m. and
12 noon, daily.
—_—_——
Is there a way to constrain tesseract to certain column boundaries? (Obviously I could do this by cutting up the images but I'd like to avoid that work.)