We have a pipeline running in Google Cloud Platform that:
- extracts crops from a text document image
- processes those crops to ensure they are always black text on white background
- passes the crops to pytesseract to extract the text.
Most times, everything works well and the extracted text is correct, except for some crops.
One example is a multiline crop in the format, which is often output incorrectly, e.g.:
35LURC194- -> output as SSLUBe404-
6 6
(this is a slightly modificed instance of the issue, but you get the gist)
Now, here is where things become weird.
As part of our debugging process, we ran the same code locally, and, for every instance where the OCR text is faulty on production (Cloud), it works accurately on the local machine!
The differences between local and Cloud environment are:
Local | Cloud | |
---|---|---|
Operating System | Arch Linux | Debian Slim Buster Docker image |
Python version | 3.10.10 | 3.8.6 |
RAM | 8 GB | 3 GB |
Environment | Native | Docker Container (Cloud Run) |
Things we've tried so far:
- Ensured the versions of the important packages (pytesseract, torch, torchvision, Tesseract) are the same on local and production
- Added more RAM and CPU to the Cloud Run instance
- Upgraded the Python version in the container Dockerfile to 3.10.10
- Ensured the cropped image that's being passed to the Tesseract is the same in both scenarios (same aspect ratio, looks the same)
- Tripled checked that the code running locally is the same as the one that's running on cloud
- Ran Tesseract with different OEM settings and the correct PSM (multiline) in both scenarios
We're running out of ideas on what could be causing this, it's baffling really. Everything up until the tesseract processing step is the same in both scenarios, so the issue must have to do with Tesseract itself or the environment, but yet, everything is the same except the Operating System itself.
Would love to hear any ideas on what else we could try, or whether someone else had a similar experience.