Joblib Parallel multiple cpu's slower than single

Question

I've just started using the Joblib module and I'm trying to understand how the Parallel function works. Below is an example of where parallelizing leads to longer runtimes but I don't understand why. My runtime on 1 cpu was 51 sec vs. 217 secs on 2 cpu.

My assumption was that running the loop in parallel would copy lists a and b to each processor. Then dispatch item_n to one cpu and item_n+1 to the other cpu, execute the function and then write the results back to a list (in order). Then grab the next 2 items and so on. I'm obviously missing something.

Is this a poor example or use of joblib? Did I simply structure the code wrong?

Here is the example:

import numpy as np
from matplotlib.path import Path
from joblib import Parallel, delayed

## Create pairs of points for line segments
a = zip(np.random.rand(5000,2),np.random.rand(5000,2))

b = zip(np.random.rand(300,2),np.random.rand(300,2))

## Check if one line segment contains another. 
def check_paths(path, paths):
    for other_path in paths:
        res='no cross'
        chck = Path(other_path)
        if chck.contains_path(path)==1:
            res= 'cross'
            break
    return res

res = Parallel(n_jobs=2) (delayed(check_paths) (Path(points), a) for points in b)

Did you run your 1 process test with the same code (only `n_jobs=1`) or did you use a simple for-loop? — , Jan 09 '14 at 18:12
Also there is a big warning on the `joblib` site to protect the main code with `if __name__ == '__main__':`. — , Jan 09 '14 at 18:23
I have exactly the same problem. I am running everything from inside 'if __name__ == '__main__':', and actually I'm just using their example problem of : output = Parallel(n_jobs=2)(delayed(sqrt)(i**2) for i in range(int(1e5))) . If I run it with n_jobs=1, it takes 5 seconds. If n_jobs=2 through n_jobs=4 (it is a 4-core machine), it takes 42 seconds!? — David Doria, Jul 31 '14 at 13:43
@DavidDoria many computations may not benefit from parallel computation because of the time it takes to map the computation to other processors and then return the results. I get roughly the same results as you do with the joblib example. I've found the biggest gains when comparing items in a list to a list of items and needing to return the closest match (e.g. geocoding or fuzzy address matching). — mhabiger, Aug 01 '14 at 16:22
My parallel appeared to be running slower than a single cpu until I found out it was my Timer code. To see the function running, add in the `verbose=50` argument; this will output time elapsed and job details. E.g. Parallel(n_jobs=4, verbose=50) — Will, Jan 12 '15 at 17:06

score 42 · Accepted Answer · 2014-01-09T19:38:49.413

In short: I cannot reproduce your problem. If you are on Windows you should use a protector for your main loop: documentation of joblib.Parallel. The only problem I see is much data copying overhead, but your numbers seem unrealistic to be caused by that.

In long, here are my timings with your code:

On my i7 3770k (4 cores, 8 threads) I get the following results for different n_jobs:

For-loop: Finished in 33.8521318436 sec
n_jobs=1: Finished in 33.5527760983 sec
n_jobs=2: Finished in 18.9543449879 sec
n_jobs=3: Finished in 13.4856410027 sec
n_jobs=4: Finished in 15.0832719803 sec
n_jobs=5: Finished in 14.7227740288 sec
n_jobs=6: Finished in 15.6106669903 sec

So there is a gain in using multiple processes. However although I have four cores the gain already saturates at three processes. So I guess the execution time is actually limited by memory access rather than processor time.

You should notice that the arguments for each single loop entry are copied to the process executing it. This means you copy a for each element in b. That is ineffective. So instead access the global a. (Parallel will fork the process, copying all global variables to the newly spawned processes, so a is accessible). This gives me the following code (with timing and main loop guard as the documentation of joblib recommends:

import numpy as np
from matplotlib.path import Path
from joblib import Parallel, delayed
import time
import sys

## Check if one line segment contains another. 

def check_paths(path):
    for other_path in a:
        res='no cross'
        chck = Path(other_path)
        if chck.contains_path(path)==1:
            res= 'cross'
            break
    return res

if __name__ == '__main__':
    ## Create pairs of points for line segments
    a = zip(np.random.rand(5000,2),np.random.rand(5000,2))
    b = zip(np.random.rand(300,2),np.random.rand(300,2))

    now = time.time()
    if len(sys.argv) >= 2:
        res = Parallel(n_jobs=int(sys.argv[1])) (delayed(check_paths) (Path(points)) for points in b)
    else:
        res = [check_paths(Path(points)) for points in b]
    print "Finished in", time.time()-now , "sec"

Timing results:

 n_jobs=1: Finished in 34.2845709324 sec
 n_jobs=2: Finished in 16.6254048347 sec
 n_jobs=3: Finished in 11.219119072 sec
 n_jobs=4: Finished in 8.61683392525 sec
 n_jobs=5: Finished in 8.51907801628 sec
 n_jobs=6: Finished in 8.21842098236 sec
 n_jobs=7: Finished in 8.21816396713 sec
 n_jobs=8: Finished in 7.81841087341 sec

The saturation now slightly moved to n_jobs=4 which is the value to be expected.

check_paths does several redundant calculations that can easily be eliminated. Firstly for all elements in other_paths=a the line Path(...) is executed in every call. Precalculate that. Secondly the string res='no cross' is written is each loop turn, although it may only change once (followed by a break and return). Move the line in front of the loop. Then the code looks like this:

import numpy as np
from matplotlib.path import Path
from joblib import Parallel, delayed
import time
import sys

## Check if one line segment contains another. 

def check_paths(path):
    #global a
    #print(path, a[:10])
    res='no cross'
    for other_path in a:
        if other_path.contains_path(path)==1:
            res= 'cross'
            break
    return res

if __name__ == '__main__':
    ## Create pairs of points for line segments
    a = zip(np.random.rand(5000,2),np.random.rand(5000,2))
    a = [Path(x) for x in a]

    b = zip(np.random.rand(300,2),np.random.rand(300,2))

    now = time.time()
    if len(sys.argv) >= 2:
        res = Parallel(n_jobs=int(sys.argv[1])) (delayed(check_paths) (Path(points)) for points in b)
    else:
        res = [check_paths(Path(points)) for points in b]
    print "Finished in", time.time()-now , "sec"

with timings:

n_jobs=1: Finished in 5.33742594719 sec
n_jobs=2: Finished in 2.70858597755 sec
n_jobs=3: Finished in 1.80810618401 sec
n_jobs=4: Finished in 1.40814709663 sec
n_jobs=5: Finished in 1.50854086876 sec
n_jobs=6: Finished in 1.50901818275 sec
n_jobs=7: Finished in 1.51030707359 sec
n_jobs=8: Finished in 1.51062297821 sec

A side node on your code, although I haven't really followed its purpose as this was unrelated to your question, contains_path will only return True if this path completely contains the given path. (see documentation). Therefore your function will basically always return no cross given the random input.

Thank you very much for providing such a thorough answer. This really helped improve my understanding of parallel processing. The problem was indeed not protecting the main loop. I am using OSX and just assumed that Windows was affected by this. — mhabiger, Jan 09 '14 at 20:38
Very helpful answer from 3 years ago now, many thanks to Nabla. — DanielSon, Feb 09 '17 at 23:51
through experimentation i realized that the `batch_size` parameter can significantly impact the thread optimization. — muon, Aug 24 '17 at 02:39

Gael Varoquaux · Answer 2 · 2018-08-08T19:02:29.337

26

In addition to the above answer, and for future reference, there are two aspects to this question, and joblib's recent evolutions helps with both.

Parallel pool creation overhead: The problem here is that creating a parallel pool is costly. It's was especially costly here, as the code not protected by the "main" was run in each job at creation of the Parallel object. In the latest joblib (still beta), Parallel can be used as a context manager to limit the number of time a pool is created, and thus the impact of this overhead.

Dispatching overhead: it is important to keep in mind that dispatching an item of the for loop has an overhead (much bigger than iterating a for loop without parallel). Thus, if these individual computation items are very fast, this overhead will dominate the computation. In the latest joblib, joblib will trace the execution time of each job and start bunching them if they are very fast. This strongly limits the impact of the dispatch overhead in most cases (see the PR for bench and discussion).

Disclaimer: I am the original author of joblib (just saying to warn against potential conflicts of interest in my answer, although here I think that it is irrelevant).

edited Aug 08 '18 at 19:02

answered Sep 18 '15 at 10:33

Gael Varoquaux

2,466
2
24
12

1

Actually, the Parallel infrastructure of joblib is currently getting a rewamp, and tradeoffs will soon change. Agreed that the docs and examples should be improved. – Gael Varoquaux Aug 28 '17 at 14:27
Great to hear, indeed. **A rigorous self-documentation of the performance impacts**, coming from various real-world use-cases and setups **would** be both a serious performance self-benchmarking and **of an immense educational role.** – user3666197 Aug 28 '17 at 15:03
It's really good to hear your opinion on documentation. It is mine too. But it takes more than people thinking that good documentation is important to happen. As I now lead a team of quite a few developers and researcher, I am no longer able to do this work myself, so we need to find people to write such documentation, and it is a more challenging task than it may seem. – Gael Varoquaux Aug 28 '17 at 15:07
Need not to convince me about this fact, man. Also impressed by having read the sad tiny footnote [No.1], linked to the sentence " **Yet an essential problem remains computation time** ", in the INRIA Parietal AMPHI ad, [Pg.2] >>> https://team.inria.fr/parietal/files/2017/07/amphi_post_doc.pdf. That's warning and ringing all the bells even more. – user3666197 Aug 28 '17 at 15:17
A few remarks on a more realistic **`joblib`** performance testing + module performance self.documentation was sent for your inspiration & further decisions in email. Looking forward for the above announced new `joblib` version to arrive in public & all the best to your research teams. – user3666197 Aug 28 '17 at 16:01

Joblib Parallel multiple cpu's slower than single

2 Answers2

Linked

Related