4

Suppose I have the following function:

def f(x,y):
    return x*y

How do I apply the funtion to each element in an NxM 2D numpy array using the multiprocessing module? Using serial iteration, the code might look as follows:

import numpy as np
N = 10
M = 12
results = np.zeros(shape=(N,M))
for x in range(N):
    for y in range(M):
        results[x,y] = f(x,y)
PTTHomps
  • 1,477
  • 2
  • 22
  • 38
  • 2
    I assume that this is a toy model and what you need to do is more complicated but numpy has efficient functions to do what you wrote in your code – Julien Spronck Apr 24 '15 at 21:13
  • I'm going through the documentation at the moment, but haven't found how to apply a function to each element in an array yet. Any guidance? – PTTHomps Apr 24 '15 at 21:14
  • see answer below for this simple case – Julien Spronck Apr 24 '15 at 21:17
  • @JulienSpronck looks like you forgot to post the answer you mention – aensm Apr 24 '15 at 21:28
  • 1
    Julien is right - multiprocessing is the *last* tool to reach for when you are trying to optimize numpy code. Multiprocessing in Python is slow and cumbersome, and there are usually much bigger efficiency gains to be had using [broadcasting](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html) and BLAS-optimized linear algebra operations. Beyond that, there is [Cython](http://cython.org/), [numba](http://numba.pydata.org/) and [numexpr](https://code.google.com/p/numexpr/). Since you haven't shown us the code you're actually trying to optimize, it's hard to give more specific advice. – ali_m Apr 24 '15 at 21:28
  • @aensm I posted it but then OP told me that multiprocessing was needed, posting it again – Julien Spronck Apr 24 '15 at 21:30
  • my real world application runs a machine learning routine on large amounts of data. The purpose for running it against a 2D numpy array is to allow me to identify desirable values for gamma and C in an SVM with the RBF kernel. Besides that, this is a useful example to help understand the multiprocessing module better. – PTTHomps Apr 24 '15 at 22:10

2 Answers2

8

Here's how you might parallelize your example function using multiprocesssing. I've also included an almost identical pure Python function that uses non-parallel for loops, and a numpy one-liner that achieves the same result:

import numpy as np
from multiprocessing import Pool


def f(x,y):
    return x * y

# this helper function is needed because map() can only be used for functions
# that take a single argument (see http://stackoverflow.com/q/5442910/1461210)
def splat_f(args):
    return f(*args)

# a pool of 8 worker processes
pool = Pool(8)

def parallel(M, N):
    results = pool.map(splat_f, ((i, j) for i in range(M) for j in range(N)))
    return np.array(results).reshape(M, N)

def nonparallel(M, N):
    out = np.zeros((M, N), np.int)
    for i in range(M):
        for j in range(N):
            out[i, j] = f(i, j)
    return out

def broadcast(M, N):
    return np.prod(np.ogrid[:M, :N])

Now let's look at the performance:

%timeit parallel(1000, 1000)
# 1 loops, best of 3: 1.67 s per loop

%timeit nonparallel(1000, 1000)
# 1 loops, best of 3: 395 ms per loop

%timeit broadcast(1000, 1000)
# 100 loops, best of 3: 2 ms per loop

The non-parallel pure Python version beats the parallelized version by a factor of about 4, and the version using numpy array broadcasting absolutely crushes the other two.

The problem is that starting and stopping Python subprocesses carries quite a lot of overhead, and your test function is so trivial that each worker thread spends only a tiny proportion of its lifetime doing useful work. Multiprocessing only makes sense if each thread has a substantial amount of work to do before it is killed. You might, for example, give each worker a bigger chunk of the output array to compute (try messing around with the chunksize= parameter to pool.map()), but with such a trivial example I doubt you'll see a big improvement.

I don't know what your actual code looks like - maybe your function is big and expensive enough to warrant using multiprocessing. However, I would bet that there are much better ways to improve its performance.

ali_m
  • 71,714
  • 23
  • 223
  • 298
  • I'm sure there are performance gains to be had through other means, but currently one run takes about 10 minutes, and only uses a single logical core. It maxes out the core, but only makes use of one. Thanks for the answer :) – PTTHomps Apr 24 '15 at 22:34
  • @TraxusIV Believe me, I know the temptation of wanting to see all my cores working at full tilt, but trust me - that's the wrong way to approach optimization! Always start out by profiling your code (e.g. using `line_profiler`) and identify where the bottlenecks are, then focus on those. Use BLAS and broadcasting wherever you can, and use Cython or numba for any nasty inner `for` loops that you can't get rid of with broadcasting. – ali_m Apr 24 '15 at 22:39
  • As a side note: instead of using `splat_f` and `pool.map` you can simplify to `pool.starmap` that does internally the same thing as `splat_f`. – sophros Aug 03 '17 at 17:03
  • @sophros Yes, provided you're using Python 3.3 or newer (see http://stackoverflow.com/q/5442910/1461210, which I linked to above) – ali_m Aug 03 '17 at 17:19
0

Not sure multiprocessing is needed in your case. In the simple example above, you can do

X, Y = numpy.meshgrid(numpy.arange(10), numpy.arange(12))
result = X*Y
Julien Spronck
  • 15,069
  • 4
  • 47
  • 55