3

We have a pipeline running in Google Cloud Platform that:

  1. extracts crops from a text document image
  2. processes those crops to ensure they are always black text on white background
  3. passes the crops to pytesseract to extract the text.

Most times, everything works well and the extracted text is correct, except for some crops.

One example is a multiline crop in the format, which is often output incorrectly, e.g.:

35LURC194-     -> output as SSLUBe404-
6                           6      

(this is a slightly modificed instance of the issue, but you get the gist)

Now, here is where things become weird.

As part of our debugging process, we ran the same code locally, and, for every instance where the OCR text is faulty on production (Cloud), it works accurately on the local machine!

The differences between local and Cloud environment are:

Local Cloud
Operating System Arch Linux Debian Slim Buster Docker image
Python version 3.10.10 3.8.6
RAM 8 GB 3 GB
Environment Native Docker Container (Cloud Run)

Things we've tried so far:

  • Ensured the versions of the important packages (pytesseract, torch, torchvision, Tesseract) are the same on local and production
  • Added more RAM and CPU to the Cloud Run instance
  • Upgraded the Python version in the container Dockerfile to 3.10.10
  • Ensured the cropped image that's being passed to the Tesseract is the same in both scenarios (same aspect ratio, looks the same)
  • Tripled checked that the code running locally is the same as the one that's running on cloud
  • Ran Tesseract with different OEM settings and the correct PSM (multiline) in both scenarios

We're running out of ideas on what could be causing this, it's baffling really. Everything up until the tesseract processing step is the same in both scenarios, so the issue must have to do with Tesseract itself or the environment, but yet, everything is the same except the Operating System itself.

Would love to hear any ideas on what else we could try, or whether someone else had a similar experience.

ephores
  • 43
  • 5
  • 1
    It seems like the issue is not with the code itself but rather with the environment where it's running. You may want to check on: Check the Tesseract version, Check the language data files, Check the image preprocessing, Try running Tesseract without preprocessing, Check the locale settings, Try running Tesseract with different configurations, and Check for differences in character encoding – Chanpols Apr 03 '23 at 20:02
  • 1
    @Chanpols indeed, one of my latest attempts was to run the docker container locally, and it still produces the wrong output with the exact same test code, so it's definitely environment. I've played around quite a lot with the package versions, made sure that i have the exact same versions as on my local PC where it's running correctly, but that didn't do the trick. I'll check some more and update if I have anything, thank you! – ephores Apr 04 '23 at 12:03

1 Answers1

1

So in the end it was indeed a version issue, had to do with the versions of the language data files.

This answer solved it for me, I basically downloaded the language data files with wget, copied them in the Dockerfile to /usr/share/tesseract-ocr/4.00/tessdata (directory can vary depending on your OS) and it worked like a charm.

The strange thing still is that normally installing the packages that provide this language files should be enough (e.g. apt install tesseract-ocr-eng on Debian) but in my case these provided versions were not giving me the right outputs.

Note: An important step that helped me find the solution was actually running the container locally (normally it runs in Cloud Run on GCP), this allowed for a much quicker debugging and experimentation.

ephores
  • 43
  • 5