1

I have around 4000 data points and I have a program that processes them. Due to the huge number of points the program is very slow, although I've applied some vectorization using numpy.arange in nested loops.

I searched for pool.map, the problem is that it takes only one argument. I see there exist some answers to this problem here, Python multiprocessing pool.map for multiple arguments. I used the last one which uses map method with a list of arguments, I have around 4 args, I put them in a list and passed in the map method a long with the function name. In the function, I've extracted each argument from the list and perform the operation, but it doesn't work. This is the code where I call map,

if __name__ == '__main__':
    pool= Pool(processes=8)
    p= pool.map (kriging1D, [x,v,a,n])
    plt.scatter(x,v,color='red')
    plt.plot(range(5227),p,color='blue')

This is the function to be parallelized,

def kriging1D(args):
    x=args[0]
    v=args [1]
    a= args [2]
    n= args [3]
#perform some operations on the args..
...
#return the result.. 

But, I get this error,

plt.plot(range(5227),p,color='blue')
NameError: name 'p' is not defined

Note: before adding this line,

if __name__ == '__main__':

I got this error,

RuntimeError: 


Attempt to start a new process before the current process
        has finished its bootstrapping phase.

        This probably means that you are on Windows and you have
        forgotten to use the proper idiom in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce a Windows executable.

That's why I've added the if statement.

For More Clarity: v and x are vectors each of a large size as 4000 (both have the same length). My intent is to parallelize the processing of each v[i] and x[i], so for example process multiple v and x elements at a time, instead of processing elements one by one.

Can anyone please tell me what mistake I'm doing? Or, suggest an alternative method?

Thank You.

Community
  • 1
  • 1
Dania
  • 1,648
  • 4
  • 31
  • 57

2 Answers2

1

The operation of map() and I assume consequently of pool.map() (I haven't used it myself) is as follows.

Calling map(myfunc, [1, 2, 3]) calls myfunc on each of the arguments 1, 2, 3 in turn. myfunc(1), then myfunc(2) etc.

So pool.map(kriging1D, [x,v,a,n]) is equivalent to calling kriging1D(x), then kriging1D(v), and so on, no? From your method body, it looks like that is not what you want to do. Are you sure you really want to be using pool.map and not pool.apply instead?

I apologise if I have misunderstood your question; this isn't my area of expertise but I thought I'd try and help since there are no answers yet.

Sam
  • 8,330
  • 2
  • 26
  • 51
  • Thanks a lot, I'm really new to multiple processing in python. according to what you've explained regarding map, as you said this is not what I want to do. As I said in the question I want to parallelize the processing of the indices of the vectors. But, I'm really not sure how to do it. I've read that pool.apply is no longer used, can you please tell me what technique to use to achieve my goal? Thank you. – Dania Jul 01 '15 at 08:32
  • @Dania I don't think I can help; I don't have a very good understanding of multiprocessing or of what you're trying to do. – Sam Jul 01 '15 at 08:39
1

The syntax you are using is appropriate for **apply*, which is a single invocation, and not batch parallel.

>>> from pathos.multiprocessing import ProcessPool as Pool
>>> p = Pool()
>>> 
>>> def do_it(x,y,z):
...   return x+y*z
... 
>>> p.apply(do_it, [2,3,4])
14

If you want to use batch parallel, you'd need to give a list of the same length for each parameter. Here, I'm running a 3-argument function in 5-way parallel -- note the length 5 lists.

>>> p.map(do_it, range(5),range(5,10),range(0,10,2))
[0, 13, 30, 51, 76]

If you want to use this syntax, you need the multiprocessing fork called pathos (or there is also parmap) -- both are also found in the SO answer you linked in the question.

If you want to use the stdlib multiprocessing, then you should look at the other answers in the same question.

Hopefully, the above will clarify those answers, however.

Mike McKerns
  • 33,715
  • 8
  • 119
  • 139