Sharing numpy arrays in python multiprocessing pool

Question

I'm working on some code that does some fairly heavy numerical work on a large (tens to hundreds of thousands of numerical integrations) set of problems. Fortunately, these integrations are embarassingly parallel, so it's easy to use Pool.map() to split up the work across multiple cores.

Right now, I have a program that has this basic workflow:

#!/usr/bin/env python
from multiprocessing import Pool
from scipy import *
from my_parser import parse_numpy_array
from my_project import heavy_computation

#X is a global multidimensional numpy array
X = parse_numpy_array("input.dat")
param_1 = 0.0168
param_2 = 1.505

def do_work(arg):
  return heavy_computation(X, param_1, param_2, arg)

if __name__=='__main__':
  pool = Pool()
  arglist = linspace(0.0,1.0,100)
  results = Pool.map(do_work,arglist)
  #save results in a .npy file for analysis
  save("Results", [X,results])

Since X, param_1, and param_2 are hard-coded and initialized in exactly the same way for each process in the pool, this all works fine. Now that I have my code working, I'd like to make it so that the file name, param_1, and param_2 are input by the user at run-time, rather than being hard-coded.

One thing that should be noted is that X, param_1, and param_2 are not modified as the work is being done. Since I don't modify them, I could do something like this at the beginning of the program:

import sys
X = parse_numpy_array(sys.argv[1])
param_1 = float(sys.argv[2])
param_2 = float(sys.argv[3])

And that would do the trick, but since most users of this code are running the code from Windows machines, I'd rather not go the route of command-line arguments.

What I would really like to do is something like this:

X, param_1, param_2 = None, None, None

def init(x,p1, p2)
  X = x
  param_1 = p1
  param_2 = p2

if __name__=='__main__':
  filename = raw_input("Filename> ")
  param_1 = float(raw_input("Parameter 1: "))
  param_2 = float(raw_input("Parameter 2: "))
  X = parse_numpy_array(filename)
  pool = Pool(initializer = init, initargs = (X, param_1, param_2,))
  arglist = linspace(0.0,1.0,100)
  results = Pool.map(do_work,arglist)
  #save results in a .npy file for analysis
  save("Results", [X,results])

But, of course, this fails and X/param_1/param_2 are all None when the pool.map call happens. I'm pretty new to multiprocessing, so I'm not sure why the call to the initializer fails. Is there a way to do what I want to do? Is there a better way to go about this altogether? I've also looked at using shared data, but from my understanding of the documentation, that only works on ctypes, which don't include numpy arrays. Any help with this would be greatly appreciated.

According to [this](http://coding.derkeiler.com/Archive/Python/comp.lang.python/2008-09/msg00937.html) Numpy can be made to play nicely with ctypes. — Ken, Aug 15 '12 at 02:45
Instead of looking at the documentation, you should have looked at [Stack](http://stackoverflow.com/questions/7894791/use-numpy-array-in-shared-memory-for-multiprocessing) [Overflow](http://stackoverflow.com/a/5036766/577088) :) — senderle, Aug 15 '12 at 02:56
@senderle I'm not sure you should ever encourage people to not look at the documentation. I agree that searching SO is usually more helpful, though. — Ken, Aug 15 '12 at 03:00

score 5 · Accepted Answer · answered Sep 23 '12 at 06:28

I had a similar problem. If you just want to read my solution skip some lines :) I had to:

share a numpy.array between threads operating on different part of it and...
pass Pool.map a function with more then one argument.

I noticed that:

the data of the numpy.array was correctly read but...
changes on the numpy.array where not made permanent
Pool.map had problems handling lambda functions, or so it appeared to me (if this point is not clear to you, just ignore it)

My solution was to:

make the target function only argument a list
make the target function return the modified data instead of directly trying to write on the numpy.array

I understand that your do_work function already return the computed data, so you would just have to modify to_work to accept a list (containing X,param_1,param_2 and arg) as argument and to pack the input to the target function in this format before passing it to Pool.map.

Here is a sample implementation:

def do_work2(args):
    X,param_1,param_2,arg = args
    return heavy_computation(X, param_1, param_2, arg)

Now you have to pack the input to the do_work function before calling it. Your main become:

if __name__=='__main__':
   filename = raw_input("Filename> ")
   param_1 = float(raw_input("Parameter 1: "))
   param_2 = float(raw_input("Parameter 2: "))
   X = parse_numpy_array(filename)
   # now you pack the input arguments
   arglist = [[X,param1,param2,n] for n in linspace(0.0,1.0,100)]
   # consider that you're not making 100 copies of X here. You're just passing a reference to it
   results = Pool.map(do_work2,arglist)
   #save results in a .npy file for analysis
   save("Results", [X,results])

Everything imported from `multiprocessing` (not from `threading`) uses `pickle` to pass arguments to functins. As `labmda` functions cannot be pickled , `Pool.map` cannot use it as an argument passed to function. That is why `Pool.map had problems handling lambda functions` — xolodec, Mar 23 '14 at 07:02

score -2 · Answer 2 · answered Aug 15 '12 at 02:55

-2

To make your last idea work, I think you can simply make X, param_1, and param_2 global variables by using the global keyword before modifying them inside the if statement. So add the following:

global X
global param_1
global param_2

directly after the if __name__ == '__main__'.

answered Aug 15 '12 at 02:55

Ken

1,778
11
10

1

I don't think this does anything. The `if` statement is in the global namespace, so `X`, `param_1`, and `param_2` are already global. In any case, globalness isn't the problem here; this is a `multiprocessing`-specific problem. – senderle Aug 15 '12 at 03:08
Too bad that isn't the problem. I don't do much with `multiprocessing` since my problems are almost never embarrassingly parallel. The variables inside the `if` are, however, not in the global namespace by my intuition and then experiments. – Ken Aug 15 '12 at 03:23
I'm not sure what experiments you did, but if you run this script: `if __name__ == '__main__': a = 5; print globals()['a']`, Python prints '5'. So I'm pretty sure `a` is a in the global namespace. – senderle Aug 15 '12 at 04:07

Sharing numpy arrays in python multiprocessing pool

2 Answers2

Linked