Parallelise python loop with numpy arrays and shared-memory

Question

I am aware of several questions and answers on this topic, but haven't found a satisfactory answer to this particular problem:

What is the easiest way to do a simple shared-memory parallelisation of a python loop where numpy arrays are manipulated through numpy/scipy functions?

I am not looking for the most efficient way, I just wanted something simple to implement that doesn't require a significant rewrite when the loop is not run in parallel. Just like OpenMP implements in lower level languages.

The best answer I've seen in this regard is this one, but this is a rather clunky way that requires one to express the loop into a function that takes a single argument, several lines of shared-array converting crud, seems to require that the parallel function is called from __main__, and it doesn't seem to work well from the interactive prompt (where I spend a lot of my time).

With all of Python's simplicity is this really the best way to parellelise a loop? Really? This is something trivial to parallelise in OpenMP fashion.

I have painstakingly read through the opaque documentation of the multiprocessing module, only to find out that it is so general that it seems suited to everything but a simple loop parallelisation. I am not interested in setting up Managers, Proxies, Pipes, etc. I just have a simple loop, fully parallel that doesn't have any communication between tasks. Using MPI to parallelise such a simple situation seems like overkill, not to mention it would be memory-inefficient in this case.

I haven't had time to learn about the multitude of different shared-memory parallel packages for Python, but was wondering if someone has more experience in this and can show me a simpler way. Please do not suggest serial optimisation techniques such as Cython (I already use it), or using parallel numpy/scipy functions such as BLAS (my case is more general, and more parallel).

related: [OpenMP and Python](http://stackoverflow.com/q/11368486/4279). See examples in my answer. — jfs, Oct 25 '12 at 23:07
On Linux, the code in the answer you link to works fine from the interactive prompt. Also, Cython does support openmp-based parallelization, and it is very simple to use (replace `range` by `prange` in the loop): http://docs.cython.org/src/userguide/parallelism.html — pv., Oct 25 '12 at 23:41
@pv, thanks for the link. It looks quite simple. But I assume prange can only be used C functions? This brings other issues, such as using numpy/scipy array functions from inside Cython. I don't assume there is an easy interface for the C equivalent of those functions to be used inside Cython? — tiago, Oct 26 '12 at 00:42
OpenMP is typically used for fine grained parallelism of tight loops. The reason you can't find anything equivalent in python is because python doesn't give good performance for tight loops. If you don't need tight loops then use the multiprocessing module. If you do then use cython as suggested. — DaveP, Oct 26 '12 at 03:20
@tiago: you can wrap the insides the prange loop in `with nogil:` to use any Python constructs. Some Numpy functions do release the GIL during the operation, so you may get some parallelism. However, accesses to Python objects are always serialized, so the threads are unavoidable partly synchronized. This is as good as parallelism gets in Python within a single process --- you need to use multiprocessing to get more. — pv., Oct 26 '12 at 13:23

pv. · Accepted Answer · 2012-10-26T22:12:37.387

With Cython parallel support:

# asd.pyx
from cython.parallel cimport prange

import numpy as np

def foo():
    cdef int i, j, n

    x = np.zeros((200, 2000), float)

    n = x.shape[0]
    for i in prange(n, nogil=True):
        with gil:
            for j in range(100):
                x[i,:] = np.cos(x[i,:])

    return x

On a 2-core machine:

$ cython asd.pyx
$ gcc -fPIC -fopenmp -shared -o asd.so asd.c -I/usr/include/python2.7
$ export OMP_NUM_THREADS=1
$ time python -c 'import asd; asd.foo()'
real    0m1.548s
user    0m1.442s
sys 0m0.061s

$ export OMP_NUM_THREADS=2
$ time python -c 'import asd; asd.foo()'
real    0m0.602s
user    0m0.826s
sys 0m0.075s

This runs fine in parallel, since np.cos (like other ufuncs) releases the GIL.

If you want to use this interactively:

# asd.pyxbdl
def make_ext(modname, pyxfilename):
    from distutils.extension import Extension
    return Extension(name=modname,
                     sources=[pyxfilename],
                     extra_link_args=['-fopenmp'],
                     extra_compile_args=['-fopenmp'])

and (remove asd.so and asd.c first):

>>> import pyximport
>>> pyximport.install(reload_support=True)
>>> import asd
>>> q1 = asd.foo()
# Go to an editor and change asd.pyx
>>> reload(asd)
>>> q2 = asd.foo()

So yes, in some cases you can parallelize just by using threads. OpenMP is just a fancy wrapper for threading, and Cython is therefore only needed here for the easier syntax. Without Cython, you can use the threading module --- works similarly as multiprocessing (and probably more robustly), but you don't need to do anything special to declare arrays as shared memory.

However, not all operations release the GIL, so YMMV for the performance.

***

And another possibly useful link scraped from other Stackoverflow answers --- another interface to multiprocessing: http://packages.python.org/joblib/parallel.html

Thank you, that seems great. I will experiment with some code. Just found out that it's not straightforward to use OpenMP with Python from MacPorts, as it uses clang by default. But using gcc manually I could make your example work. — tiago, Oct 26 '12 at 22:22
Hi pv., a quick question - would this work on Windows too? Because I did not know where to set OMP_NUM_THREADS for Windows... Any links to get me started? — Yuxiang Wang, May 26 '14 at 16:27

score 4 · Answer 2 · edited May 23 '17 at 12:33

Using a mapping operation (in this case multiprocessing.Pool.map()) is more or less the the canonical way to paralellize a loop on a single machine. Unless and until the built-in map() is ever paralellized.

An overview of the different possibilities can be found here.

You can use openmp with python (or rather cython), but it doesn't look exactly easy.

IIRC, the point if only running multiprocessing stuff from __main__ is a neccesity because of compatibility with Windows. Since windows lacks fork(), it starts a new python interpreter and has to import the code in it.

Edit

Numpy can paralellize some operations like dot(), vdot() and innerproduct(), when configured with a good multithreading BLAS library like e.g. OpenBLAS. (See also this question.)

Since numpy array operations are mostly by element it seems possible to parallelize them. But this would involve setting up either a shared memory segment for python objects, or dividing the arrays up into pieces and feeding them to the different processes, not unlike what multiprocessing.Pool does. No matter what approach is taken, it would incur memory and processing overhead to manage all that. One would have to run extensive tests to see for which sizes of arrays this would actually be worth the effort. The outcome of those tests would probably vary considerable per hardware architecture, operating system and amount of RAM.

Thank you for the link for OpenMP with Cython, I didn't know about that. Sadly it doesn't seem the answer I was looking for. I have seen the page you mention on scipy.org, and also [this one](http://wiki.python.org/moin/ParallelProcessing). But it seems that most of the options listed require a complex rewrite of existing code. I was just looking for a simple way to parellelise numpy/scipy operations on arrays. — tiago, Oct 25 '12 at 21:37
Fixed the scipy.org link. The euroscipy link says "temporarily unavailable", so it should come back. — Roland Smith, Aug 14 '13 at 17:12

score 0 · Answer 3 · answered Jan 22 '17 at 11:34

The .map( ) method of the mathDict( ) class in ParallelRegression does exactly what you are looking for in two lines of code that should be very easy at an interactive prompt. It uses true multiprocessing, so the requirement that the function to be run in parallel is pickle-able is unavoidable, but this does provide an easy way to loop over a matrix in shared memory from multiple processes.

Say you have a pickle-able function:

def sum_row( matrix, row ):
    return( sum( matrix[row,:] ) )

Then you just need to create a mathDict( ) object representing it, and use mathDict( ).map( ):

matrix = np.array( [i for i in range( 24 )] ).reshape( (6, 4) )

RA, MD = mathDictMaker.fromMatrix( matrix, integer=True )
res = MD.map( [(i,) for i in range( 6 )], sum_row, ordered=True )

print( res )
# [6, 22, 38, 54, 70, 86]

The documentation (link above) explains how to pass a combination of positional and keyword arguments into your function, including the matrix itself at any position or as a keyword argument. This should enable you to use pretty much any function you've already written without modifying it.

Parallelise python loop with numpy arrays and shared-memory

3 Answers3

Linked