Python: Parallel Processing in Joblib Makes the Code Run Even Slower

Question

I want to integrate a parallel processing to make my for loops run faster.

However, I noticed that it has just made my code run slower. See below example where I am using joblib with a simple function on a list of random integers. Notice that without the parallel processing it runs faster than with.

Any insight as to what is happening?

def f(x):
    return x**x

if __name__ == '__main__':
    s = [random.randint(0, 100) for _ in range(0, 10000)]


    # without parallel processing
    t0 = time.time()
    out1 = [f(x) for x in s]
    t1 = time.time()
    print("without parallel processing: ", t1 - t0)

    # with parallel processing
    t0 = time.time()
    out2 = Parallel(n_jobs=8, batch_size=len(s), backend="threading")(delayed(f)(x) for x in s)
    t1 = time.time()
    print("with parallel processing: ", t1 - t0)

I am getting the following output:

without parallel processing:  0.0070569515228271484
with parallel processing:     0.10714387893676758

Parallel processing involves additional overhead because of more complex setup. You normally don't want to parallelize tasks that take microsecond to complete. — Muposat, Oct 13 '17 at 19:41
I also tried it on a more complicated fuzzy matching function, and it still took a long time — Riley Hun, Oct 13 '17 at 20:01
Possible duplicate of [Why is the following simple parallelized code much slower than a simple loop in Python?](https://stackoverflow.com/questions/46727090/why-is-the-following-simple-parallelized-code-much-slower-than-a-simple-loop-in) — user3666197, Oct 14 '17 at 10:44
If indeed willing to understand **what happens** and **how to make faster processing**, definitely read argumentation and test-results in the above posted link. [PARALLEL]-scheduling does not bring speedups for free, so rather set yourself for a lot of learning about performance-motivated design tricks and for a lot of testing the actual add-on costs v/s a potential performance gains. **Indeed a thrilling domain.** — user3666197, Oct 14 '17 at 10:55

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

1

The parameter batch_size=len(s) effectively says give each process a batch of s jobs. This means you create 8 threads but then give all workload to 1 thread.

Also you might want to increase the workload to have a measurable advantage. I prefer to use time.sleep delays:

def f(x):
    time.sleep(0.001)
    return x**x

out2 = Parallel(n_jobs=8,
                #batch_size=len(s),
                backend="threading")(delayed(f)(x) for x in s)

without parallel processing: 11.562264442443848

with parallel processing: 1.412865400314331

edited Jun 20 '20 at 09:12

Community

1
1

answered Oct 14 '17 at 05:02

Muposat

1,476
1
11
24

1

With all due respect, Sir, adding sleep into f() is creating a false illusion, with respect to the Amdahl's Law. The [PARALLEL]-fraction ( here ) is so tiny, that [SERIAL]-part dominates and adding a sleep()-into the [PARALLEL]-section does not improve the processing performance, but skews the comparison of the composition of [SER]+[PAR]/1 v/s [SER]+[PAR]/N. The costs of [PAR]-Setup and [PAR]-Terminate overheads are the more "masked" by adding sleep(). – user3666197 Oct 14 '17 at 10:40

Python: Parallel Processing in Joblib Makes the Code Run Even Slower

1 Answers1