Parfor for Python

Question

I am looking for a definitive answer to MATLAB's parfor for Python (Scipy, Numpy).

Is there a solution similar to parfor? If not, what is the complication for creating one?

UPDATE: Here is a typical numerical computation code that I need speeding up

import numpy as np

N = 2000
output = np.zeros([N,N])
for i in range(N):
    for j in range(N):
        output[i,j] = HeavyComputationThatIsThreadSafe(i,j)

An example of a heavy computation function is:

import scipy.optimize

def HeavyComputationThatIsThreadSafe(i,j):
    n = i * j

    return scipy.optimize.anneal(lambda x: np.sum((x-np.arange(n)**2)), np.random.random((n,1)))[0][0,0]

JudoWill · Answer 1 · 2011-01-17T17:38:33.413

The one built-in to python would be multiprocessing docs are here. I always use multiprocessing.Pool with as many workers as processors. Then whenever I need to do a for-loop like structure I use Pool.imap

As long as the body of your function does not depend on any previous iteration then you should have near linear speed-up. This also requires that your inputs and outputs are pickle-able but this is pretty easy to ensure for standard types.

UPDATE: Some code for your updated function just to show how easy it is:

from multiprocessing import Pool
from itertools import product

output = np.zeros((N,N))
pool = Pool() #defaults to number of available CPU's
chunksize = 20 #this may take some guessing ... take a look at the docs to decide
for ind, res in enumerate(pool.imap(Fun, product(xrange(N), xrange(N))), chunksize):
    output.flat[ind] = res

You should replace `output[ind]` by `output.flat[ind]` to make the code work. (`output` is a two-dimensional array and would need two indices.) — Sven Marnach, Jan 17 '11 at 06:57
@Sven: Thanks ... that comes from switching between matlab and python all the time. — JudoWill, Jan 17 '11 at 17:38

Sven Marnach · Accepted Answer · 2015-05-06T16:31:05.307

20

There are many Python frameworks for parallel computing. The one I happen to like most is IPython, but I don't know too much about any of the others. In IPython, one analogue to parfor would be client.MultiEngineClient.map() or some of the other constructs in the documentation on quick and easy parallelism.

edited May 06 '15 at 16:31

answered Jan 13 '11 at 16:41

Sven Marnach

574,206
118
941
841

1

+1 Didn't know about client.MultiEngineClient even though I do use IPython. Thanks for the steer! – David Heffernan Jan 13 '11 at 16:46
1

It is not apparent to me whether I can run a code sped up with IPython parallel computing framework in script mode, i.e. not running through ipython. – Dat Chu Jan 13 '11 at 17:17
@Dat Chu: Of course you can. Just write the commands you would type at the prompt in a file an run it with Python. (Is this what you are asking for?) – Sven Marnach Jan 13 '11 at 18:05
1

Up-to-date link to [the documentation on quick and easy parallelism](http://ipython.org/ipython-doc/stable/parallel/parallel_multiengine.html#quick-and-easy-parallelism). – tsh Aug 30 '11 at 13:41
Sven, I think you mean parfor where you write matfor. – A. Donda May 06 '15 at 15:05
Actually that is where the pain come from. There are too many of them and none of them suit for all propose. And you need to try out which works and which doesn't yourself. And which is faster in which case. Sometimes some implementation will be even slower than no parallel. – River May 05 '20 at 16:26
1

@River Yes, optimising code is tedious, and parallelism is hard. I suggest you start with multiprocessing, which is part of the standard library. – Sven Marnach May 05 '20 at 17:51

score 11 · Answer 3 · edited Jun 20 '20 at 09:12

Jupyter Notebook

To see an example consider you want to write the equivalence of this Matlab code on in Python

matlabpool open 4
parfor n=0:9
   for i=1:10000
       for j=1:10000
           s=j*i   
       end
   end
   n
end
disp('done')

The way one may write this in python particularly in jupyter notebook. You have to create a function in the working directory (I called it FunForParFor.py) which has the following

def func(n):
    for i in range(10000):
        for j in range(10000):
            s=j*i
    print(n)

Then I go to my Jupyter notebook and write the following code

import multiprocessing  
import FunForParFor

if __name__ == '__main__':
    pool = multiprocessing.Pool(processes=4)
    pool.map(FunForParFor.func, range(10))
    pool.close()
    pool.join()   
    print('done')

This has worked for me! I just wanted to share it here to give you a particular example.

Ion Stoica · Answer 4 · 2019-02-04T21:14:30.537

This can be done elegantly with Ray, a system that allows you to easily parallelize and distribute your Python code.

To parallelize your example, you'd need to define your functions with the @ray.remote decorator, and then invoke them with .remote.

import numpy as np
import time

import ray

ray.init()

# Define the function. Each remote function will be executed 
# in a separate process.
@ray.remote
def HeavyComputationThatIsThreadSafe(i, j):
    n = i*j
    time.sleep(0.5) # Simulate some heavy computation. 
    return n

N = 10
output_ids = []
for i in range(N):
    for j in range(N):
        # Remote functions return a future, i.e, an identifier to the 
        # result, rather than the result itself. This allows invoking
        # the next remote function before the previous finished, which
        # leads to the remote functions being executed in parallel.
        output_ids.append(HeavyComputationThatIsThreadSafe.remote(i,j))

# Get results when ready.
output_list = ray.get(output_ids)
# Move results into an NxN numpy array.
outputs = np.array(output_list).reshape(N, N)

# This program should take approximately N*N*0.5s/p, where
# p is the number of cores on your machine, N*N
# is the number of times we invoke the remote function,
# and 0.5s is the time it takes to execute one instance
# of the remote function. For example, for two cores this
# program will take approximately 25sec.

There are a number of advantages of using Ray over the multiprocessing module. In particular, the same code will run on a single machine as well as on a cluster of machines. For more advantages of Ray see this related post.

Note: One point to keep in mind is that each remote function is executed in a separate process, possibly on a different machine, and thus the remote function's computation should take more than invoking a remote function. As a rule of thumb a remote function's computation should take at least a few 10s of msec to amortize the scheduling and startup overhead of a remote function.

score 4 · Answer 5 · answered Jan 13 '11 at 16:36

4

I've always used Parallel Python but it's not a complete analog since I believe it typically uses separate processes which can be expensive on certain operating systems. Still, if the body of your loops are chunky enough then this won't matter and can actually have some benefits.

answered Jan 13 '11 at 16:36

David Heffernan

601,492
42
1,072
1,490

Separate processes is also the default behavior of Matlab's `parfor`. [This page](https://www.mathworks.com/help/parallel-computing/choose-between-thread-based-and-process-based-environments.html) explains how to get threads instead, but warns that functionality is limited. [This page](https://www.mathworks.com/help/parallel-computing/examples/scale-up-from-desktop-to-cluster.html) mentions that `local` process-based parallelism is the default. – japreiss Jun 14 '20 at 20:20

score 4 · Answer 6 · answered Sep 13 '18 at 08:39

4

I tried all solutions here, but found that the simplest way and closest equivalent to matlabs parfor is numba's prange.

Essentially you change a single letter in your loop, range to prange:

from numba import autojit, prange

@autojit
def parallel_sum(A):
    sum = 0.0
    for i in prange(A.shape[0]):
        sum += A[i]

    return sum

answered Sep 13 '18 at 08:39

Felix

81
5

1

this only speeds up if the computation is entirely supported by numba, see docs for [list](https://numba.pydata.org/numba-doc/dev/reference/numpysupported.html) – Nimrod Morag Nov 18 '19 at 13:16

score 1 · Answer 7 · answered Mar 04 '20 at 01:38

I recommend trying joblib Parallel.

one liner

from joblib import Parallel, delayed
out = Parallel(n_jobs=2)(delayed(heavymethod)(i) for i in range(10))

instructional

instead of taking a for loop

from time import sleep
for _ in range(10):
   sleep(.2)

rewrite your operation into a list comprehension

[sleep(.2) for _ in range(10)]

Now let us not directly evaluate the expression, but collect what should be done. This is what the delayed method is for.

from joblib import delayed
[delayed(sleep(.2)) for _ in range(10)]

Next instantiate a parallel process with n_workers and process the list.

from joblib import Parallel
r = Parallel(n_jobs=2, verbose=10)(delayed(sleep)(.2) for _ in range(10))

[Parallel(n_jobs=2)]: Done   1 tasks      | elapsed:    0.6s
[Parallel(n_jobs=2)]: Done   4 tasks      | elapsed:    0.8s
[Parallel(n_jobs=2)]: Done  10 out of  10 | elapsed:    1.4s finished

score 1 · Answer 8 · answered Sep 02 '20 at 18:17

Ok, I'll also give it a go, let's see if my way is easier

from multiprocessing import Pool
def heavy_func(key):
    #do some heavy computation on each key 
    output = key**2
    return key, output 

output_data ={}     #<--this dict will store the results
keys = [1,5,7,8,10] #<--compute heavy_func over all the values of keys
with Pool(processes=40) as pool:
    for i in pool.imap_unordered(heavy_func, keys):
        output_data[i[0]] = i[1]

Now output_data is a dictionary that will contain for every key the result of the computation on this key.

That is it..

I like this as it is very useful for practical use. It would be really helpful maybe to replace `i` with something more explicit, and if you explained why the use of `imap_unordered()` instead of `imap()` or `map()`. — eric, May 26 '22 at 14:42

Parfor for Python

8 Answers8

Jupyter Notebook

one liner

instructional

Linked