Naive multiprocessing in Python with NumPy

Question

Despite the warnings and confused feelings I got from the ton of questions that have been asked on the subject, especially on StackOverflow, I paralellized a naive version of an embarassingly parallel problem (basically read-image-do-stuff-return for a list of many images), returned the resulting NumPy array for each computation and updated a global NumPy array via the callback parameter, and immediately got a x5 speedup on a 8-core machine.

Now, I probably didn't get x8 because of the lock required by each callback call, but what I got is encouraging.

I'm trying to find out if this can be improved upon, or if this is a good result. Questions :

I suppose the returned NumPy arrays got pickled?
Were the underlying NumPy buffers copied or just passed by reference?
How can I find out what the bottleneck is? Any particularly useful technique?
Can I improve on that or is such an improvement pretty common in such cases?

It's going to be pretty hard to give any sort of robust answer to most of these questions... — mgilson, Apr 10 '13 at 16:30
@mgilson: Well, it does work pretty well as it is. I just wanted to know a bit more about what it does under the hood. I'm especially unsure about exactly _which_ objects get pickled, copied or anything else. The Python profiler does not say anything about the subprocesses, unfortunately, so I wasn't able to clearly determine the overhead of using `multiprocessing` that way! Any idea is welcome ;) — F.X., Apr 10 '13 at 19:48

Velimir Mlaker · Answer 1 · 2013-05-20T17:53:29.090

I've had great success sharing large NumPy arrays (by reference, of course) between multiple processes using sharedmem module: https://bitbucket.org/cleemesser/numpy-sharedmem. Basically it suppresses pickling that normally happens when passing around NumPy arrays. All you have to do is, instead of:

import numpy as np
foo = np.empty(1000000)

do this:

import sharedmem
foo = sharedmem.empty(1000000)

and off you go passing foo from one process to another, like:

q = multiprocessing.Queue()
...
q.put(foo)

Note however, that this module has a known possibility of a memory leak upon ungraceful program exit, described to some extent here: http://grokbase.com/t/python/python-list/1144s75ps4/multiprocessing-shared-memory-vs-pickled-copies.

Hope this helps. I use the module to speed up live image processing on multi-core machines (my project is https://github.com/vmlaker/sherlock.)

Ah, I must confess I forgot about this question, sorry about that! Since it was a bit too big for a comment, I added an answer about how I ended up doing this, it might be interesting to you or others. Not sure which one I have to accept though...? — F.X., May 21 '13 at 19:06

score 0 · Accepted Answer · edited May 23 '17 at 12:28

Note: This answer is how I ended up solving the issue, but Velimir's answer is more suited if you're doing intense transfers between your processes. I don't, so I didn't need sharedmem.

How I did it

It turns out that the time spent pickling my NumPy arrays was negligible, and I was worrying too much. Essentially, what I'm doing is a MapReduce operation, so I'm doing this :

First, on Unix systems, any object you instantiate before spawning a process will be present (and copied) in the context of the process if needed. This is called copy-on-write (COW), and is handled automagically by the kernel, so it's pretty fast (and definitely fast enough for my purposes). The docs contained a lot of warnings about objects needing pickling, but here I didn't need that at all for my inputs.
Then, I ended up loading my images from the disk, from within each process. Each image is individually processed (mapped) by its own worker, so I neither lock nor send large batches of data, and I don't have any performance loss.
Each worker does its own reduction for the mapped images it handles, then sends the result to the main process with a Queue. The usual outputs I get from the reduction function are 32-bit float images with 4 or 5 channels, with sizes close to 5000 x 5000 pixels (~300 or 400MB of memory each).
Finally, I retrieve the intermediate reduction outputs from each process, then do a final reduction in the main process.

I'm not seeing any performance loss when transferring my images with a queue, even when they're eating up a few hundred megabytes. I ran that on a 6 core workstation (with HyperThreading, so the OS sees 12 logical cores), and using multiprocessing with 6 cores was 6 times faster than without using multiprocessing.

(Strangely, running it on the full 12 cores wasn't any faster than 6, but I suspect it has to do with the limitations of HyperThreading.)

Profiling

Another of my concerns was profiling and quantifying how much overhead multiprocessing was generating. Here are a few useful techniques I learned :

Compared to the built-in (at least in my shell) time command, the time executable (/usr/bin/time in Ubuntu) gives out much more information, including things such as average RSS, context switches, average %CPU,... I run it like this to get everything I can :
```
 $ /usr/bin/time -v python test.py
```
Profiling (with %run -p or %prun in IPython) only profiles the main process. You can hook cProfile to every process you spawn and save the individual profiles to the disk, like in this answer.

I suggest adding a DEBUG_PROFILE flag of some kind that toggles this on/off, you never know when you might need it.
Last but not least, you can get some more or less useful information from a syscall profile (mostly to see if the OS isn't taking ages transferring heaps of data between the processes), by attaching to one of your running Python processes like :
```
 $ sudo strace -c -p <python-process-id>
```

So it sounds like your program is disk bound. And since you still got 6x speedup, the CPU's cores must have parallel access to disk. And since you **only** got 6x speedup instead of 12x must mean that the hardware thread pairs (on each core) share a single disk access resource. — Velimir Mlaker, May 21 '13 at 21:30
And, apparently on an X5550 processor, even if your program was memory bound, you'd still only get 6x, according to this: http://superuser.com/a/279803 — Velimir Mlaker, May 21 '13 at 21:32
I'm using a pretty fast SSD and the disk reads are not maxed up, but I suspected something like what you said. 6 cores are more than enough for me, so I don't mind, but it's interesting to note that the 12 cores people boast about can only do the work of half that under some conditions ;) — F.X., May 21 '13 at 22:38

Naive multiprocessing in Python with NumPy

2 Answers2

How I did it

Profiling