I have a single process that is run using subprocess
module's Popen
:
result = subprocess.Popen(['tesseract','mypic.png','myop'])
st = time()
while result.poll() is None:
sleep(0.001)
en = time()
print('Took :'+str(en-st))
Which results in:
Took :0.44703030586242676
Here, a tesseract
call is made to process an image mypic.png
(attached) and output the OCR's result to myop.txt
.
Now I want this to happen on multiple processes on behalf of this comment (or see this directly), so the code is here:
lst = []
for i in range(4):
lst.append(subprocess.Popen(['tesseract','mypic.png','myop'+str(i)]))
i=0
l = len(lst)
val = 0
while(val!=(1<<l)-1):
if(lst[i].poll() is None):
print('Waiting for :'+str(i))
sleep(0.01)
else:
temp = val
val = val or (1<<(i))
if(val!=temp):
print('Completed for :'+temp)
i = (i+1) %l
What this code does is make 4 calls to tesseract
, save the process objects in a list lst
, iterate through all of these objects until all of them are completed. Explanation for the implementation of the infinite loop is given at the bottom.
The problem here is that the latter program is taking a hell lot of time to complete. It is continuously waiting for the processes to complete using poll()
function, which is None
until the process has not been completed. This should not have happened. It should have taken a little more than 0.44s only. Not something like 10 minutes! Why is this happening?
I came to this specific error by digging into pytesseract
, which was taking a lot of time when run parallely using multiprocessing
or pathos
. So this is a scaled down version of a much bigger issue. My question on that can be found here.
Explanation for the infinite loop:
val
is 0 initially. It is ORed with the 2^i
when the ith process completes. So, if there are 3 processes, then if the first process(i=0) is completed then 2^0 = 1
is OR'ed with val
making it 1. With second and third processes being completed, val
becomes 2^0
| 2^1
| 2^2
= 7. And 2^3-1
is also 7. So the loop works until val
equals 2^{number of processes}-1
.