Preprocessing images for OCR with Tesseract: Distinguishing between black on white and white on black text

Question

I'm currently using Imagemagick and Tesseract to OCR a PDF. The input file is a table where the headers contain black on white text and the rows are represented with white on black text:

My issue is that Tesseract does great on the black on white text but has no idea what to do with the white on black text. It thinks that the black in the image above is the text and the white is whitespace, so it just reads it in as a string of gibberish.

The answer seems to be to pre-process the image to invert all the text where there's a black background and white text.

There's supposedly a paper that tackled this problem (see answer to Detect white characters on black background using Tesseract) that produced this implementation of their algorithm: https://github.com/jasonlfunk/ocr-text-extraction

While the implementation linked above does a great job of inverting the table headers, it also erroneously inverts blocks of white background in the rest of the page, including the black on white text. Does anyone know if anyone has tackled this problem or found in a workaround since that paper was published several years ago?

Please provide your full image so we can see what can be done with it. What version of ImageMagick and on what platform? — fmw42, Jan 19 '19 at 00:49
@nao, Did you find the solution? If so, I'd like you to share the solution. — Gary Chen, Apr 26 '21 at 14:36

Preprocessing images for OCR with Tesseract: Distinguishing between black on white and white on black text

0 Answers0