No improvements using multiprocessing

Question

I tested the performance of map, mp.dummy.Pool.map and mp.Pool.map

import itertools
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
import numpy as np
# wrapper function
def wrap(args): return args[0](*args[1:])
# make data arrays
x = np.random.rand(30, 100000)
y = np.random.rand(30, 100000)
# map
%timeit -n10 map(wrap, itertools.izip(itertools.repeat(np.correlate), x, y))
# mp.dummy.Pool.map
for i in range(2, 16, 2):
    print 'Thread Pool ', i, ' : ',
    t = ThreadPool(i)
    %timeit -n10 t.map(wrap, itertools.izip(itertools.repeat(np.correlate), x, y))
    t.close()
    t.join()
# mp.Pool.map
for i in range(2, 16, 2):
    print 'Process Pool ', i, ' : ',
    p = mp.Pool(i)
    %timeit -n10 p.map(wrap, itertools.izip(itertools.repeat(np.correlate), x, y))
    p.close()
    p.join()

The outputs

 # in this case, one CPU core usage reaches 100%
 10 loops, best of 3: 3.16 ms per loop

 # in this case, all CPU core usages reach ~80%
 Thread Pool   2  : 10 loops, best of 3: 4.03 ms per loop
 Thread Pool   4  : 10 loops, best of 3: 3.3 ms per loop
 Thread Pool   6  : 10 loops, best of 3: 3.16 ms per loop
 Thread Pool   8  : 10 loops, best of 3: 4.48 ms per loop
 Thread Pool  10  : 10 loops, best of 3: 4.19 ms per loop
 Thread Pool  12  : 10 loops, best of 3: 4.03 ms per loop
 Thread Pool  14  : 10 loops, best of 3: 4.61 ms per loop

 # in this case, all CPU core usages reach 80-100%
 Process Pool   2  : 10 loops, best of 3: 71.7 ms per loop
 Process Pool   4  : 10 loops, best of 3: 128 ms per loop
 Process Pool   6  : 10 loops, best of 3: 165 ms per loop
 Process Pool   8  : 10 loops, best of 3: 145 ms per loop
 Process Pool  10  : 10 loops, best of 3: 259 ms per loop
 Process Pool  12  : 10 loops, best of 3: 176 ms per loop
 Process Pool  14  : 10 loops, best of 3: 176 ms per loop

Multi-threadings does increase speed. It's acceptable due to the Lock.
Multi-processes slow down the speed a lot, which is surprising. I have eight 3.78 MHz CPUs, each with 4 cores.

If inceases the shape of x and y to (300, 10000), i.e. 10 times larger, the similar results can be seen.

But for small arrays as (20, 1000),

 10 loops, best of 3: 28.9 µs per loop

 Thread Pool  2  : 10 loops, best of 3: 429 µs per loop
 Thread Pool  4  : 10 loops, best of 3: 632 µs per loop
 ...

 Process Pool  2  : 10 loops, best of 3: 525 µs per loop
 Process Pool  4  : 10 loops, best of 3: 660 µs per loop
 ...

multi-processing and multi-threading have similar performance.
the single process is much faster. (due to overheads of multi-processing and multi-threading?)

Anyhow, even in excuting such a simple function, it's really out of expect that multiprocessing performs so bad. How can that happen?

As suggested by @TrevorMerrifield, I modified the code to avoid passing big arrays to wrap.

from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
import numpy as np
n = 30
m = 1000
# make data in wrap
def wrap(i):
    x = np.random.rand(m)
    y = np.random.rand(m)
    return np.correlate(x, y)
# map
print 'Single process :',
%timeit -n10 map(wrap, range(n))
# mp.dummy.Pool.map
print '---'
print 'Thread Pool %2d : '%(4),
t = ThreadPool(4)
%timeit -n10 t.map(wrap, range(n))
t.close()
t.join()
print '---'
# mp.Pool.map, function must be defined before making Pool
print 'Process Pool %2d : '%(4),
p = Pool(4)
%timeit -n10 p.map(wrap, range(n))
p.close()
p.join()

outputs

Single process :10 loops, best of 3: 688 µs per loop
 ---
Thread Pool  4 : 10 loops, best of 3: 1.67 ms per loop
 ---
Process Pool  4 : 10 loops, best of 3: 854 µs per loop

No improvements.

I tried another way, passing an indice to wrap to get data from global arrays x and y.

from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
import numpy as np
# make data arrays
n = 30
m = 10000
x = np.random.rand(n, m)
y = np.random.rand(n, m)
def wrap(i):   return np.correlate(x[i], y[i])
# map
print 'Single process :',
%timeit -n10 map(wrap, range(n))
# mp.dummy.Pool.map
print '---'
print 'Thread Pool %2d : '%(4),
t = ThreadPool(4)
%timeit -n10 t.map(wrap, range(n))
t.close()
t.join()
print '---'
# mp.Pool.map, function must be defined before making Pool
print 'Process Pool %2d : '%(4),
p = Pool(4)
%timeit -n10 p.map(wrap, range(n))
p.close()
p.join()

outputs

Single process :10 loops, best of 3: 133 µs per loop
 ---
Thread Pool  4 : 10 loops, best of 3: 2.23 ms per loop
 ---
Process Pool  4 : 10 loops, best of 3: 10.4 ms per loop

That's bad.....

I tried another simple example (different wrap).

from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
# make data arrays
n = 30
m = 10000
# No big arrays passed to wrap
def wrap(i):   return sum(range(i, i+m))
# map
print 'Single process :',
%timeit -n10 map(wrap, range(n))
# mp.dummy.Pool.map
print '---'
i = 4
print 'Thread Pool %2d : '%(i),
t = ThreadPool(i)
%timeit -n10 t.map(wrap, range(n))
t.close()
t.join()
print '---'
# mp.Pool.map, function must be defined before making Pool
print 'Process Pool %2d : '%(i),
p = Pool(i)
%timeit -n10 p.map(wrap, range(n))
p.close()
p.join()

The timgings:

 10 loops, best of 3: 4.28 ms per loop
 ---
 Thread Pool  4 : 10 loops, best of 3: 5.8 ms per loop
 ---
 Process Pool  4 : 10 loops, best of 3: 2.06 ms per loop

Now multiprocessing is faster.

But if changes m to 10 times larger (i.e. 100000),

 Single process :10 loops, best of 3: 48.2 ms per loop
 ---
 Thread Pool  4 : 10 loops, best of 3: 61.4 ms per loop
 ---
 Process Pool  4 : 10 loops, best of 3: 43.3 ms per loop

Again, no improvements.

How does this scale on larger data sets? Having multiple processes will incur overhead because of the cost of context switching. — Mac O'Brien, Jul 21 '16 at 20:52
@CormacO'Brien I increased the size of x and y by 10 times. The results are similar, with all timings increasing by ~10 times. The code in the Q is complete. You can test it. — wsdzbm, Jul 21 '16 at 21:32
@AkshatMahajan why multiprocessing is even much slower than a single process? — wsdzbm, Jul 21 '16 at 21:39
@Lee What OS are you using? Different OS's implement different standards for process creation and thread creation e.g. the Linux kernel does not differentiate between threads and processes except in that the former has shared state. Windows does differentiate.See http://stackoverflow.com/a/7219760/2271269 — Akshat Mahajan, Jul 21 '16 at 21:48
@Lee However, I don't think OS matters that much here. The important thing is noting that context switching between threads is substantially less expensive than context switching between processes, because processes employ different virtual address spaces while threads share the same. As a consequence, data must be unloaded and loaded again between processes, while threads can simply share the same data. So processes end up taking more time to run, especially if the data is substantial. See http://stackoverflow.com/a/5440165/2271269 — Akshat Mahajan, Jul 21 '16 at 21:54
I run the code on Ubuntu. But I also test on Windows 7. I agree OS doesn't matter. — wsdzbm, Jul 21 '16 at 22:09

score 2 · Accepted Answer · answered Jul 21 '16 at 23:24

You are mapping wrap to (a, b, c), where a is a function and b and c are 100K element vectors. All of this data is pickled when it is sent to the chosen process in the pool, then unpickled when it reaches it. This is to ensure that processes have mutually exclusive access to data.

Your problem is that the pickling process is more expensive than the correlation. As a guideline you want to minimize that amount of information that is sent between processes, and maximize the amount of work each process does, while still being spread across the # of cores on the system.

How to do that depends on the actual problem you're trying to solve. By tweaking your toy example so that your vectors were a bit bigger (1 million elements) and randomly generated in the wrap function, I could get a 2X speedup over single core, by using a process pool with 4 elements. The code looks like this:

def wrap(a):
    x = np.random.rand(1000000)
    y = np.random.rand(1000000)
    return np.correlate(x, y)

p = Pool(4)
p.map(wrap, range(30))

This is a good point. But seems we got different results. Mine: `single process, best of 3: 799 ms per loop; Pool(4), best of 3: 639 ms per loop; ThreadPool(4), best of 3: 633 ms per loop` — wsdzbm, Jul 22 '16 at 12:02
Yeah I'm not sure why you're only getting 1.25X performance and I'm only getting 2X. If you are looking to parallelize numpy code specifically you might want to look into https://github.com/pydata/numexpr. — Trevor Merrifield, Jul 22 '16 at 16:10

No improvements using multiprocessing

Anyhow, even in excuting such a simple function, it's really out of expect that multiprocessing performs so bad. How can that happen?

1 Answers1