Python multiprocessing performance

Question

This should be my third and final question regarding my attempts to increase performance on some statistical analysis that I am doing with python. I have 2 versions of my code (single core vs multiprocessing), I was expecting to gain performance by using multiple cores as I expect my code to uncompress/unpack quite a few binary strings , sadly I noticed that the performance actually decreased by using multiple cores.

I am wondering if anyone has a possible explanation for what I observe (scroll down to the April 16th update for more information)?

The key part of program is the function numpy_array (+ decode in multiprocessing), code snippet below (full code accessible via pastebin, further below):

def numpy_array(data, peaks):
    rt_counter=0
    for x in peaks:
        if rt_counter %(len(peaks)/20) == 0:
            update_progress()
        peak_counter=0
        data_buff=base64.b64decode(x)
        buff_size=len(data_buff)/4
        unpack_format=">%dL" % buff_size
        index=0
        for y in struct.unpack(unpack_format,data_buff):
            buff1=struct.pack("I",y)
            buff2=struct.unpack("f",buff1)[0]
            if (index % 2 == 0):
                data[rt_counter][1][peak_counter][0]=float(buff2)
            else:
                data[rt_counter][1][peak_counter][1]=float(buff2)
                peak_counter+=1
            index+=1
        rt_counter+=1

The multiprocessing version performs this with a set of functions, I will display the key 2 below:

def tonumpyarray(mp_arr):
    return np.frombuffer(mp_arr.get_obj())

def numpy_array(shared_arr,peaks):
    processors=mp.cpu_count()
    with contextlib.closing(mp.Pool(processes=processors,
                                    initializer=pool_init,
                                    initargs=(shared_arr, ))) as pool:
        chunk_size=int(len(peaks)/processors)
        map_parameters=[]
        for i in range(processors):
            counter = i*chunk_size
            chunk=peaks[i*chunk_size:(i+1)*chunk_size]
            map_parameters.append((chunk, counter))
        pool.map(decode,map_parameters)

def decode ((chunk, counter)):
    data=tonumpyarray(shared_arr).view(
        [('f0','<f4'), ('f1','<f4',(250000,2))])
    for x in chunk:
        peak_counter=0
        data_buff=base64.b64decode(x)
        buff_size=len(data_buff)/4
        unpack_format=">%dL" % buff_size
        index=0
        for y in struct.unpack(unpack_format,data_buff):
            buff1=struct.pack("I",y)
            buff2=struct.unpack("f",buff1)[0]
            #with shared_arr.get_lock():
            if (index % 2 == 0):
                data[counter][1][peak_counter][0]=float(buff2)
            else:
                data[counter][1][peak_counter][1]=float(buff2)
                peak_counter+=1
            index+=1
        counter+=1

Full program codes can be accessed via these pastebin links

Pastebin for single core version

Pastebin for multiprocessing version

The performance that I am observing with a file containing 239 timepoints and ~ 180k measurement pairs per timepoint is ~2.5m for single core and ~3.5 for multiprocessing.

PS: The two previous questions (of my first ever attempts at paralellization):

-- April 16th --

I have been profiling my program with the cProfile library (having cProfile.run('main()') in the __main__, which shows that there is 1 step that is slowing everything down:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
23   85.859    3.733   85.859    3.733 {method 'acquire' of 'thread.lock' objects}

The thing that I do not understand here is that thread.lock objects are used in threading (to my understanding) but should not be used in multiprocessing as each core should run a single thread (besides having it's own locking mechanism), so how is it that this occurs and why does a single call take 3.7 seconds?

can you share the links to your previous questions in this question? and paste the functions you think are important to the question itself — 0x90, Apr 15 '13 at 15:32
It very well could have something to do with the GIL. Check out this presentation: http://www.youtube.com/watch?v=Obt-vMVdM8s — alexhb, Apr 15 '13 at 15:54
Have you already profile the amount of time actually taken by the conversions and by the inter-process communication? My bet is that what you are seeing is simply the overhead of `multiprocessing`. BTW: do *not* use `def f((a,b)):` syntax, since it is ugly and works only in python2. Either unpack the sequence when calling or inside the function. — Bakuriu, Apr 15 '13 at 16:09
I haven't yet, my current comparison is based on running the whole program through `time` (not the Python time). — Bas Jansen, Apr 15 '13 at 16:12
@BasJansen This question may be better suited for codereview, since what you really want is to improved the performance of some already working code. I can already see some improvements that could speed up the computation(e.g. your `search_left`/`right` can be replaced by [native numpy calls](http://docs.scipy.org/doc/numpy/reference/generated/numpy.searchsorted.html#numpy.searchsorted) etc...) — Bakuriu, Apr 15 '13 at 16:17
my `search_left/right` are based on `bisect_left/right` from `NumPy`, the only change is to allow the use of a different shaped array (instead of transforming my array to use the `NumPy` version). — Bas Jansen, Apr 15 '13 at 16:19
@Bakuriu I really just want to know why my multiprocessing version performs worse than the single core instead of just having someone say do this instead of that. — Bas Jansen, Apr 15 '13 at 16:20
If you are interested about **why** the code is slow in multiprocessing than your question is duplicate of [this](http://stackoverflow.com/questions/13121790/using-multiprocessing-manager-list-instead-of-a-real-list-makes-the-calculation) question. If you want to optimize the code then the question is off-topic. — Bakuriu, Apr 15 '13 at 16:21
@Bakuriu I use 2.7 and shared arrays while the question that you linked is 3.3, doesn't use shared arrays and uses a manager — Bas Jansen, Apr 15 '13 at 16:23
@BasJansen The next time I'll first give a look at my glass ball to divine which version you are running on, even though it's completely irrelevant to your question. Also the question I linked *does* have an *answer* that would explain the slow down. If you don't even want to search/read carefully I don't see why anybody would want to lose his time investigating your code. — Bakuriu, Apr 15 '13 at 16:33
@Bakuriu How should a question about lists and a manager pop up when searching for performance based on a shared array. Secondly I have read your answer on that question but what you suggest (using of `multiprocessing.Array`) is something that is already being done. — Bas Jansen, Apr 15 '13 at 16:35
@BasJansen Yes, I saw that. But did you take into account that *any* access to `mp.Array` is about *20 times slower* than a normal array access? I think this alone could explain the extra minute for the parallelized version. To verify this you *ought to* profile the code using the profile module etc. — Bakuriu, Apr 15 '13 at 16:43
@Bakuriu I did not know that, it could be a good reason why it's slower than I was expecting. I am still busy implementing the profiling library (although, I am at home now), aimed at isolating the biggest loss in performance. — Bas Jansen, Apr 15 '13 at 18:18
@Bakuriu: where do you get "20 times slower"? Could you provide an example? For a numpy array created with `np.frombuffer()`, the shared array is just a blob of memory as any other. — jfs, Apr 22 '13 at 16:16
@BasJansen: check that `numpy` doesn't set CPU affinity on your system http://bugs.python.org/issue17038 i.e., the program uses several CPU cores at once. — jfs, Apr 22 '13 at 16:18
@J.F.Sebastian The "20" times slower was what I benchmarked in the other question, even though it was the comparison between a `list` access over an `mp.Array` access. If you know how to make `numpy` access as fast as a normal access while shareing the memory I think you should provide an answer. — Bakuriu, Apr 22 '13 at 16:22
@Bakuriu: do you mean: item access using Python loops such as: `for i in range(n): a[i] += 1` is slower for both numpy arrays and `mp.Array` than for a Python list (for numpy arrays it should be written as `a += 1` instead (vectorized operation)). But it is unrelated to OP's question: why the `multiprocessing` version is slower. The part that uses mp.Array to create numpy arrays from a shared memory is already based on [my answer](http://stackoverflow.com/a/7908612/4279). — jfs, Apr 22 '13 at 17:00
@J.F.Sebastian Not exactly. I mean that (for example) `a[0]` is much faster for a plain python `list` than for an `mp.Array` instance(never spoke of `numpy.ndarray`...), at least that's what I observed with some benchmarking. — Bakuriu, Apr 22 '13 at 17:03
@Bakuriu: I've meant the same thing (`a[i]` performance depending on type of `a`). OP uses `mp.Array` only through `numpy` array therefore `a[i]` performance where `a` is a numpy array is also relevant. Anyway [using vectorized operations instead of `a[i]` access can gives us 100 times speed up](http://stackoverflow.com/a/15273683/4279) but it is unrelated to the OPs question. — jfs, Apr 24 '13 at 10:41

score 2 · Accepted Answer · answered Apr 19 '13 at 01:41

Shared data is a known case of slowdowns due to synchronization.

Can you split your data among processes, or give each process an independent copy? Then your processes would not need to synchronize anything up until the moment when all calculations are done.

Then I'd let the master process join the output of all worker processors into one coherent set.

The approach may take extra RAM, but RAM is cheap nowadays.

If you ask, I'm also puzzled by 3700 ms per thread lock acquisition. OTOH profiling may be mistaken about special calls like this.

I could possibly split the data array into n copies but I'd still like to understand why and how the entire process takes so long. Would you recommend using a different way of profiling ? — Bas Jansen, Apr 19 '13 at 07:29

score 0 · Answer 2 · edited May 23 '17 at 11:53

0

Your Pastebins is empty.

The problem is that multiprocessing uses fork if its available (instead of spawning a new python proccess). Forked process share same env(file descriptors for example). May be it has some locks among them.

Here is some frustration about that: Multiprocessing or os.fork, os.exec?

edited May 23 '17 at 11:53

Community

1
1

answered Jun 15 '13 at 13:16

enomad

1,053
9
16

djpetti · Answer 3 · 2013-07-01T04:03:42.797

As far as the last part of your question, the Python docs basically say that multiprocessing.lock is a clone of threading.lock. Acquire calls on locks can take a long time because if the lock is already acquired, it will block until the lock is released. This can become a problem when multiple processes are competing for access to the same data, like in your code. Because I can't view your pastebin, I can only guess as to exactly what's going on, but most likely, you're processes are acquiring the lock for long periods of time which stops other processes from running, even if there is plenty of free CPU time. This shouldn't be affected by the GIL as that should only constrain multithreaded applications, not multiprocessed ones. So, how to fix this? My guess is that you have some sort of lock protecting your shared array that is staying locked while the process is doing intensive calculations that take a relatively long time, therefore barring access for other processes, which are subsequently blocking on their lock.acquire() calls. Assuming you have enough RAM, I strongly endorse the answer that suggests storing multiple copies of the array in each process's address space. However, just note that passing large data structures through map can cause unexpected bottlenecks, as it requires picking and depickling.

Python multiprocessing performance

3 Answers3

Linked