6

I am not sure whether it is my infrastucture that does this weird stuff or the tesseract-ocr itself.

Whenever i use image_to_stirng in single-process environment - the tesseract-ocr works fine. But when I spawn multiple workers with gunicorn and all of them get to do some work with ocr reading - the tesseract-ocr starts reading very poorly (and not from performance-vise, but accuracy-vise). Even after the load is done - tesseract never has the same accuracy. I need to restart all the workers in order to get tesseract working well again.

This is super weird. Maybe anyone has expirienced or heard of this issue ?

Laimonas Sutkus
  • 3,247
  • 2
  • 26
  • 47

1 Answers1

2

(NOTE the info below is based on review of the pytesseract.py code, I haven't tried to set up a multi-process test to check)

There are several Python libraries that interface with tesseract-ocr. You are probably using pytesseract (guessing by the image_to_string function).

This library calls the tesseract-ocr binary as a subprocess and uses temporary files to interface to it. It uses the obsolete tempfile.mktemp() which does not guarantee unique file names - further, it does not even use the returned file name as-is, so a second call to tempfile.mktemp() can easily return the same file name.

Consider using a different python interface library for tesseract: e.g., pip install tesseract-ocr or python-tesseract from Google (https://code.google.com/archive/p/python-tesseract/).

(if the problem is actually with the temp files, as I suspect) you may be able to work around this by setting a different temp directory for each of your spawned worker processes:

td = tempfile.mkdtemp()
tempfile.tempdir = td
try:
    # your-code-calling pytesseract.image_to_string() or similar
finally:
    os.rmdir(td)
    tempfile.tempdir = None
Leo K
  • 5,189
  • 3
  • 12
  • 27
  • How about tesserocr 2.3.1 ? – Laimonas Sutkus Aug 28 '18 at 09:16
  • I am not familiar with tesserocr - but it appears to be using a C interface directly to the library, so it should be both more efficient and safe from temporary files issues. – Leo K Aug 28 '18 at 10:52
  • is `tesserocr` what you are using - or an alternative that you are considering? (BTW: would be nice to update the question to specify the exact python library that exhibits the issue, to help others who are searching for solutions). – Leo K Aug 28 '18 at 10:54