2

I want to speed up the computation of u ** 2 where u is a numpy array, using the multiprocessing module.

Here is my attempt (file name multi.py) :

# to compile on Windows/Ipython  : import multi  then  run -m multi

from multiprocessing import Pool
import numpy as np

if __name__ == '__main__':
 u=np.arange(6e7)
 def test(N):
    pool = Pool(N)
    v=len(u)//N
    tasks = [ u[k*v:(k+1)*v] for k in range(N)]  
    res = pool.map_async(np.square,tasks).get()
    return res

Here are the benchmarks :

In [25]: %time  r1=test(1)
Wall time: 13.2 s

In [26]: %time  r2=test(2)
Wall time: 7.75 s

In [27]: %time  r4=test(4)
Wall time: 8.29 s

In [31]: %time r=u**2
Wall time: 512 ms

I have 2 physical cores on my PC, so test(2) running faster than test(1) is encouraging.

But for the moment, numpy is faster. The multiprocessing add big overload.

So my question is : How (or is it possible) to speed up u ** 2, with multiprocessing ?

EDIT

I realize that all process work is done in his own memory space, so necessarily a lot of copy arise (See here for example). So no hope to speed simple computation this way.

Community
  • 1
  • 1
B. M.
  • 18,243
  • 2
  • 35
  • 54

2 Answers2

2

Multiprocessing in CPython is intrinsically costly because of the Global Interpreter Lock, which prevents multiple native threads from simultaneously executing the same Python bytecode. multiprocessing works around this limitation by spawning a separate Python interpreter for every worker process, and using pickling to send arguments and return variables to and from the workers. Unfortunately this entails a lot of unavoidable overhead.

If you absolutely must use multiprocessing, it's advisable to do as much work as possible with each process in order to minimize the relative amount of time spent spawning and killing processes. For example, if you're processing chunks of a larger array in parallel then make the chunks as large as possible, and to do as many processing steps as you can in one go, rather than looping over your array multiple times.

In general, though, you will be much better off doing your multithreading in a lower-level language that isn't limited by the GIL. For simple numerical expressions, such as your example, numexpr is a very simple way to achieve a significant performance boost (~4x, on an i7 CPU with 4 cores and hyperthreading). As well as implementing parallel processing in C++, a more significant benefit is that it avoids allocating memory for intermediate results, and thus makes more efficient use of caching.

In [1]: import numexpr as ne

In [2]: u = np.arange(6e7)

In [3]: %%timeit u = np.arange(6e7)
   .....: u**2
   .....: 
1 loop, best of 3: 528 ms per loop

In [4]: %%timeit u = np.arange(6e7)
ne.evaluate("u**2")
   .....: 
10 loops, best of 3: 127 ms per loop

Other options suited to more complicated tasks include Cython and numba.

Finally, I should also mention that there are other Python implementations besides CPython that lack a GIL, for example PyPy, Jython and IronPython. However, these all suffer from their own limitations. To my knowledge, none of them offer proper support for numpy, scipy or matplotlib.

ali_m
  • 71,714
  • 23
  • 223
  • 298
  • thanks for this complete answer. curiously ne.evaluate("u* *2") is slower than u**2 on my computer with floats. It wins with int. – B. M. Feb 21 '16 at 21:49
0

I answer myself :

From scipy-cookbook, an IMHO underrated feature:

while numpy is doing an array operation, python also releases the GIL.

So multithread is not a problem for numpy operations.

from threading import Thread
import numpy as np

u=np.arange(6*10**7)

def multi(N):
    n=u.size//N 
    threads = [Thread(target=np.ndarray.__ipow__,
               args=(u[k*n:(k+1)*n],2)) for k in range(N)]  
    for t in  threads: t.start()
    for t in  threads: t.join()

with a nearly 2x gain on a 2 cores processor:

In [7]: %timeit test(1)
10 loops, best of 3: 172 ms per loop

In [8]: %timeit test(4)
10 loops, best of 3: 92.7 ms per loop
B. M.
  • 18,243
  • 2
  • 35
  • 54