5

I need to implement a multiprocessing pool that utilizes arbitrary packages for calculations. For this, I'm using Python and joblib 0.9.0. This code is basically the structure I want.

import numpy as np
from joblib import pool

def someComputation(x):
    return np.interp(x, [-1, 1], [-1, 1])

if __name__ == '__main__':
    some_set_of_numbers = [-1,-0.5,0,0.5,1]
    the_pool = pool.Pool(processes=2)
    solutions = [the_pool.apply_async(someComputation, (x,)) for x in some_set_of_numbers]
    print(solutions[0].get())

On both Windows 10 and Red Hat Enterprise Linux running Anaconda 4.3.1 Python 3.6.0 (as well as 3.5 and 3.4 with virtual envs), I get that 'np' was never passed into the someComputation() function raising the error

File "C:\Anaconda3\lib\site-packages\multiprocessing_on_dill\pool.py", line 608, in get
    raise self._value
NameError: name 'np' is not defined

however, on my Mac OS X 10.11.6 running Python 3.5 and the same joblib, I get the expected output of '-1' with the exact same code. This question is essentially the same, but it dealt with pathos and not joblib. The general answer was to include the numpy import statement inside of the function

from joblib import pool

def someComputation(x):
    import numpy as np
    return np.interp(x, [-1, 1], [-1, 1])

if __name__ == '__main__':
    some_set_of_numbers = [-1,-0.5,0,0.5,1]
    the_pool = pool.Pool(processes=2)
    solutions = [the_pool.apply_async(someComputation, (x,)) for x in some_set_of_numbers]
    print(solutions[0].get())

This solves the issue on the Windows and Linux machines, where they now output '-1' as expected but this solution seems clunky. Is there any reason why the first bit of code would work on a Mac, but not Windows or Linux? I ultimately need to run this code on the Linux machine so is there any fix that doesn't include putting the import statement inside of the function?

Edit:

After investigating a bit further, I found an old workaround I put in years ago that looks like is causing the issue. In joblib/pool.py, I changed line 44 from

from multiprocessing.pool import Pool

to

from multiprocessing_on_dill.pool import Pool

to support pickling of arbitrary functions. For some reason, this change is what really causes the issue on Windows and Linux, but the Mac machine runs just fine. Using multiprocessing instead of multiprocessing_on_dill solves the above issue, but the code doesn't work for the majority of my cases since they can't be pickled.

Community
  • 1
  • 1
  • Not sure if this answers your question, but a better way than putting the import into the function would be do declare it as `def someComputation(x, np=np):`. That should bind the module to a local name within the function when it is first interpreted, avoiding the need to run the import machinery every time. – Mad Physicist May 09 '17 at 16:23
  • That works great and I can certainly use it until we find out the real issue. – Michael Sparapany May 09 '17 at 16:45

1 Answers1

3

I am not sure what the exact issue is, but it appears that there is some problem with transferring the global scope over to the subprocesses that run the task. You can potentially avoid name errors by binding the name np as a function parameter:

def someComputation(x, np=np):
    return np.interp(x, [-1, 1], [-1, 1])

This has the advantage of not requiring a call to the import machinery every time the function is run. The name np will be bound to the function when it is first evaluated during module loading.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264