1

Despite the warnings and confused feelings I got from the ton of questions that have been asked on the subject, especially on StackOverflow, I paralellized a naive version of an embarassingly parallel problem (basically read-image-do-stuff-return for a list of many images), returned the resulting NumPy array for each computation and updated a global NumPy array via the callback parameter, and immediately got a x5 speedup on a 8-core machine.

Now, I probably didn't get x8 because of the lock required by each callback call, but what I got is encouraging.

I'm trying to find out if this can be improved upon, or if this is a good result. Questions :

  • I suppose the returned NumPy arrays got pickled?
  • Were the underlying NumPy buffers copied or just passed by reference?
  • How can I find out what the bottleneck is? Any particularly useful technique?
  • Can I improve on that or is such an improvement pretty common in such cases?
F.X.
  • 6,809
  • 3
  • 49
  • 71
  • It's going to be pretty hard to give any sort of robust answer to most of these questions... – mgilson Apr 10 '13 at 16:30
  • @mgilson: Well, it does work pretty well as it is. I just wanted to know a bit more about what it does under the hood. I'm especially unsure about exactly _which_ objects get pickled, copied or anything else. The Python profiler does not say anything about the subprocesses, unfortunately, so I wasn't able to clearly determine the overhead of using `multiprocessing` that way! Any idea is welcome ;) – F.X. Apr 10 '13 at 19:48

2 Answers2

0

I've had great success sharing large NumPy arrays (by reference, of course) between multiple processes using sharedmem module: https://bitbucket.org/cleemesser/numpy-sharedmem. Basically it suppresses pickling that normally happens when passing around NumPy arrays. All you have to do is, instead of:

import numpy as np
foo = np.empty(1000000)

do this:

import sharedmem
foo = sharedmem.empty(1000000)

and off you go passing foo from one process to another, like:

q = multiprocessing.Queue()
...
q.put(foo)

Note however, that this module has a known possibility of a memory leak upon ungraceful program exit, described to some extent here: http://grokbase.com/t/python/python-list/1144s75ps4/multiprocessing-shared-memory-vs-pickled-copies.

Hope this helps. I use the module to speed up live image processing on multi-core machines (my project is https://github.com/vmlaker/sherlock.)

Velimir Mlaker
  • 10,664
  • 4
  • 46
  • 58
  • Ah, I must confess I forgot about this question, sorry about that! Since it was a bit too big for a comment, I added an answer about how I ended up doing this, it might be interesting to you or others. Not sure which one I have to accept though...? – F.X. May 21 '13 at 19:06
0

Note: This answer is how I ended up solving the issue, but Velimir's answer is more suited if you're doing intense transfers between your processes. I don't, so I didn't need sharedmem.

How I did it

It turns out that the time spent pickling my NumPy arrays was negligible, and I was worrying too much. Essentially, what I'm doing is a MapReduce operation, so I'm doing this :

  • First, on Unix systems, any object you instantiate before spawning a process will be present (and copied) in the context of the process if needed. This is called copy-on-write (COW), and is handled automagically by the kernel, so it's pretty fast (and definitely fast enough for my purposes). The docs contained a lot of warnings about objects needing pickling, but here I didn't need that at all for my inputs.

  • Then, I ended up loading my images from the disk, from within each process. Each image is individually processed (mapped) by its own worker, so I neither lock nor send large batches of data, and I don't have any performance loss.

  • Each worker does its own reduction for the mapped images it handles, then sends the result to the main process with a Queue. The usual outputs I get from the reduction function are 32-bit float images with 4 or 5 channels, with sizes close to 5000 x 5000 pixels (~300 or 400MB of memory each).

  • Finally, I retrieve the intermediate reduction outputs from each process, then do a final reduction in the main process.

I'm not seeing any performance loss when transferring my images with a queue, even when they're eating up a few hundred megabytes. I ran that on a 6 core workstation (with HyperThreading, so the OS sees 12 logical cores), and using multiprocessing with 6 cores was 6 times faster than without using multiprocessing.

(Strangely, running it on the full 12 cores wasn't any faster than 6, but I suspect it has to do with the limitations of HyperThreading.)

Profiling

Another of my concerns was profiling and quantifying how much overhead multiprocessing was generating. Here are a few useful techniques I learned :

  • Compared to the built-in (at least in my shell) time command, the time executable (/usr/bin/time in Ubuntu) gives out much more information, including things such as average RSS, context switches, average %CPU,... I run it like this to get everything I can :

     $ /usr/bin/time -v python test.py
    
  • Profiling (with %run -p or %prun in IPython) only profiles the main process. You can hook cProfile to every process you spawn and save the individual profiles to the disk, like in this answer.

    I suggest adding a DEBUG_PROFILE flag of some kind that toggles this on/off, you never know when you might need it.

  • Last but not least, you can get some more or less useful information from a syscall profile (mostly to see if the OS isn't taking ages transferring heaps of data between the processes), by attaching to one of your running Python processes like :

     $ sudo strace -c -p <python-process-id>
    
Community
  • 1
  • 1
F.X.
  • 6,809
  • 3
  • 49
  • 71
  • So it sounds like your program is disk bound. And since you still got 6x speedup, the CPU's cores must have parallel access to disk. And since you **only** got 6x speedup instead of 12x must mean that the hardware thread pairs (on each core) share a single disk access resource. – Velimir Mlaker May 21 '13 at 21:30
  • 1
    And, apparently on an X5550 processor, even if your program was memory bound, you'd still only get 6x, according to this: http://superuser.com/a/279803 – Velimir Mlaker May 21 '13 at 21:32
  • I'm using a pretty fast SSD and the disk reads are not maxed up, but I suspected something like what you said. 6 cores are more than enough for me, so I don't mind, but it's interesting to note that the 12 cores people boast about can only do the work of half that under some conditions ;) – F.X. May 21 '13 at 22:38