Poor Performance of Parallel Nested Loop in Python

Question

I try to make my code faster so I want the loops to run in parallel. My first question was, how do I implement a parallel loop in Python and which loop do I parallelize?

Here is my code:

from joblib import Parallel, delayed
import multiprocessing as mp
import numpy as np
import time

def compR(R, M, I, l):
    for p in range(ppl):
        for i in range(N):
            R[l][p] += np.dot(M[l][p],I[i])
    return R

level = 20
ppl = 8
x = 70
y = 10
N = 15

M = [[np.random.rand(x,y) for p in range(ppl)] for l in range(level)]
R = [[np.zeros((x,1)) for p in range(ppl)] for l in range(level)]
I = [np.random.rand(y,1) for n in range(N)]

# Seriell
t0 = time.time()
for l in range(level):
    for p in range(ppl):
        for i in range(N):
            R[l][p] += np.dot(M[l][p],I[i])
t1 = time.time()
print(t1-t0)

R = [[np.zeros((x,1)) for p in range(ppl)] for l in range(level)]

num_cores = mp.cpu_count()

# Parallel            
t0 = time.time()
R = Parallel(n_jobs=num_cores)(delayed(compR)(R, M, I, l) for l in range(level))            
t1 = time.time()
print(t1-t0)

Is the implementation correct? I tried to let the level loop run in parallel since normally level > N > ppl. When I run to code for larger numbers of level and x and y the performance of the parallel loop is really bad. What do I do wrong?

Multiprocessing has overhead starting up and transferring data to and from the processes. If the parallel work isn't significant enough, this overhead can easily take more time than just doing the work serially in one process. — Mark Tolonen, Dec 14 '17 at 19:10
@MarkTolonen I don't get it. For a large number of `level` the parallel code is 50 times slower. How can that be. Even with overhead. — Gilfoyle, Dec 14 '17 at 19:14
Transferring all that data via interprocess communication is expensive. — Mark Tolonen, Dec 14 '17 at 19:15
@Samuel it's not just the overhead of starting up and transferring, it's that you are using shared state, which needs to be serialized (pickled) and sent across processes. Sharing state in combination with `multiprocessing` is *hard to do correctly* to begin with, and will likely involves a *lot* of overhead. — juanpa.arrivillaga, Dec 14 '17 at 19:17
@Samuel anyway, your algorithm is well into polynomial time, parallelization, even with no overhead, isn't going to help you a lot. You are looking at `O(ppl*level*d^3)` where `d` stands in for your dot-product, dot product being a quartic-time algorithm (although, that can be improved slightly depending on what implementation you are using, but its still polynomial). — juanpa.arrivillaga, Dec 14 '17 at 19:19
Running it serially is 50x faster than running in parallel, so run it serially. If you need faster, multiprocessing isn't the solution. Look into extensions like `ctypes` or SWIG. Both have options to work with `numpy` data without a lot of transfer overhead. — Mark Tolonen, Dec 14 '17 at 19:20
It looks like your entire code would be better optimized if you use pure `numpy.ndarray`s, instead of list-of-numpy arrays, then use `numpy/scipy` matrix-algebra functions, and make sure your numpy/scipy is using a good BLAS implementation backend, which will be parallelized and more finely tuned than you could hope to get with `multiprocessing`. — juanpa.arrivillaga, Dec 14 '17 at 19:22

Poor Performance of Parallel Nested Loop in Python

0 Answers0