2

I'm currently using Imagemagick and Tesseract to OCR a PDF. The input file is a table where the headers contain black on white text and the rows are represented with white on black text:

table headers

My issue is that Tesseract does great on the black on white text but has no idea what to do with the white on black text. It thinks that the black in the image above is the text and the white is whitespace, so it just reads it in as a string of gibberish.

The answer seems to be to pre-process the image to invert all the text where there's a black background and white text.

There's supposedly a paper that tackled this problem (see answer to Detect white characters on black background using Tesseract) that produced this implementation of their algorithm: https://github.com/jasonlfunk/ocr-text-extraction

While the implementation linked above does a great job of inverting the table headers, it also erroneously inverts blocks of white background in the rest of the page, including the black on white text. Does anyone know if anyone has tackled this problem or found in a workaround since that paper was published several years ago?

nao
  • 1,128
  • 1
  • 14
  • 37

0 Answers0