2

I'm using OCR on historical newspapers that contain 6 columns per page. At present I use FineReader and define text blocks for each column. I'd like to use Tesseract. Tesseract gets the columns mostly right, but every few lines it reads into adjacent columns. I wonder if there's a way to set its parameters so that it will look quite rigidly for six columns.

Following suggestions on other questions, I've tried playing with --psm and hocr without great success.

Working with a jpg I've posted on github, and converting it into a text-embedded pdf using this code tesseract 1906-07-02-p4.jpg out -l eng+fra --psm 1 pdf I get this result:

enter image description here

Clearly the engine is making a bloc containing the indented lines, and another containing the flush lines.

Confirming this is the text output of the flush lines:


Grocery, Bar and Coffea shop of the trpops
stationed at the Citadel, Cairo.

to received tender for this service by 10 a.m.,
on Saturday, the 14th Jaly, 1906.

application in person to the Commandant,
Citadel, between the hours of 10 a.m. and
12 noon, daily.
—_—_——

Is there a way to constrain tesseract to certain column boundaries? (Obviously I could do this by cutting up the images but I'd like to avoid that work.)

Will Hanley
  • 457
  • 3
  • 16
  • have you tried with a different mode of PSM? i think you should try with --psm 6 – maulik kansara Aug 14 '19 at 09:51
  • `--psm 6` is worse--it reads single lines across all six columns. :( – Will Hanley Aug 14 '19 at 19:33
  • 1
    oh. if you have a fixed page design, then you can scan for each column using UZN files with coordinates. – maulik kansara Aug 16 '19 at 06:48
  • 1
    For anyone who might be interested: you can make a big improvement in column recognition if you use a paint/photo editing program to draw straight black lines between every column on the source image. – Will Hanley Sep 16 '19 at 20:00
  • Will Hanley, this is very curious. Are you still working in this area? I am trying to join some data set examples for comparasions (I am trying pytesseract) have you done something in this area? – JJoao Oct 18 '22 at 07:35
  • @JJoao I have not returned to tesseract recently--will give it another shot with pytesseract and see if column treatment has developed. What do you mean by "join some data set examples"? – Will Hanley Oct 19 '22 at 12:44

1 Answers1

-1

you can user

psm 4 oem 1

or psm 4 oem 3 to get better text and accuracy

kiran beethoju
  • 141
  • 1
  • 4