PyTesseract call working very slow when used along with multiprocessing

Question

I've a function that takes in a list of images and produces the output, in a list, after applying OCR to the image. I have an another function that controls the input to this function, by using multiprocessing. So, when I have a single list (i.e. no multiprocessing), each image of the list took ~ 1s, but when I increased the lists that had to be processed parallely to 4, each image took an astounding 13s.

To understand where the problem really is, I tried to create a minimal working example of the problem. Here I have two functions eat25 and eat100 which open an image name and feed it to the OCR, that uses the API pytesseract. eat25 does it 25 times, and eat100 does it 100 times.

My aim here is to run eat100 without multiprocessing, and eat25 with multiprocessing (with 4 processes). This, theoretically, should take 4 times less time that eat100 if I have 4 separate processors (I have 2 cores with 2 threads per core, thus CPU(s) = 4 (correct me if I'm wrong here)).

But all theory laid wasted when I saw that the code didn't even respond after printing "Processing 0" 4 times. The single processor function eat100 worked fine though.

I had tested a simple range cubing function, and it did work well with multiprocessing, so my processors do work well for sure. The only culprits here could be:

pytesseract: See this
Bad code? Something I am not doing right.

`

from pathos.multiprocessing import ProcessingPool
from time import time 
from PIL import Image
import pytesseract as pt
def eat25(name):
    for i in range(25):
        print('Processing :'+str(i))
        pt.image_to_string(Image.open(name),lang='hin+eng',config='--psm 6')
def eat100(name):
    for i in range(100):
        print('Processing :'+str(i))
        pt.image_to_string(Image.open(name),lang='hin+eng',config='--psm 6')
st = time()
eat100('normalBox.tiff')
en = time()
print('Direct :'+str(en-st))
#Using pathos
def caller():
    pool = ProcessingPool()
    pool.map(eat25,['normalBox.tiff','normalBox.tiff','normalBox.tiff','normalBox.tiff'])
if (__name__=='__main__'):
    caller()
en2 = time()

print('Pathos :'+str(en2-en))

So, where the problem really is? Any help is appreciated!

EDIT: The image normalBox.tiff can be found here. I would be glad if people reproduce the code and check if the problem continues.

I've noticed you're using `pathos.multiprocessing` module. Why not use native `ProcessPoolExecutor` from standard concurrent.futures package? — Samuel, Nov 28 '18 at 12:05

score 5 · Accepted Answer · answered Nov 30 '18 at 17:33

5

I'm thepathos author. If your code takes 1s to run serially, then it's quite possible that it will take longer to run in naive process parallel. There is overhead to working with naive process parallel:

a new python instance has to be spun up on each processor
your function and dependencies need to get serialized and sent to each processor
your data needs to get serialized and sent to the processors
the same for deserialization
you can run into memory issues from either long-live pools or lots of data serialization.

I'd suggest checking a few simple things to check where your issues might be:

try the pathos.pools.ThreadPool to use thread parallel instead of process parallel. This can reduce some of the overhead for serialization and spinning up the pool.
try the pathos.pools._ProcessPool to change how pathos manages the pool. Without the underscore, pathos keeps the pool around as a singleton, and requires a 'terminate' to explicitly kill the pool. With the underscore, the pool dies when you delete the pool object. Note that your caller function does not close or join (or terminate) the pool.
you might want to check how much you are serializing by trying to dill.dumps one of the elements you are trying to process in parallel. Things like big numpy arrays can take a while to serialize. If the size of what is being passed around is large, you might consider using a shared memory array (i.e. a multiprocess.Array or the equivalent version for numpy arrays -- also see: numpy.ctypeslib) to minimize what is being passed between each process.

The latter is a bit more work, but can provide huge savings if you have a lot to serialize. There is no shared memory pool, so you have to do a for loop over the individual multiprocess.Process objects if you need to go that route.

answered Nov 30 '18 at 17:33

Mike McKerns

33,715
8
119
139

Thanks for answering! I just wanted to tell you that this code runs _very_ fine in windows. This is the real problem. – Mooncrater Dec 02 '18 at 07:04
Plus, regarding your suggestions, the first two behaved in the same way as the original one. There is a problem in `dill` for `tiff` images, as after pickling and unpickling, the tiff image raises an error when feeded to `pytesseract`. It works well when used with PNGs. – Mooncrater Dec 02 '18 at 07:06
1

I'm also the `dill` author. `dill` doesn't know anything about `tiff` images versus `png` images... so it would most likely be an issue with some object (I assume) `pytesseract` creates to handle the `tiff`. First, let me see if I understand correctly... the code behaves as expected on windows, but hangs on linux/mac? Secondly, can you try `dill.copy` and `dill.check` on the object you are trying to send to the different parallel workers? If both or either throw an error, then you have a serialization issue. You can then use `dill.detect.trace` to diagnose any serialization issues. – Mike McKerns Dec 02 '18 at 11:57
Yes, it hangs (works very slowly, to be precise) on linux. Not tested on mac though. – Mooncrater Dec 03 '18 at 10:28
In the code that I've provided, I only pass the _name_ (string) of the file to the parallel workers. So when I check it, with `dill.check`, the name of the file is printed. When I use `dill.copy` nothing is printed, and no error is raised. – Mooncrater Dec 03 '18 at 10:54
1

You are mapping the name (a `string`) to the return value of `pytesseract.image_to_string`. So the serialization should only have to deal with whatever object is returned (probably a string, it would seem). So, I doubt that's causing the issues. If you are only passing strings in and out of the processes... then the most likely candidate is it's a memory issue. Are you swapping your memory? – Mike McKerns Dec 04 '18 at 01:58
No, I don't think so. – Mooncrater Dec 14 '18 at 14:47

PyTesseract call working very slow when used along with multiprocessing

1 Answers1

Linked