Why is the time a piece of the whole computaion (benchmking_f in this case) takes in parallel so much longer than the one in sequential approach?

Question

I am trying to compare sequential computation and parallel computation in Python.

This is the bench mark function.

def benchmking_f(n=0):
    import time
    items = range(int(10**(6+n)))

    def f2(x):return x*x

    start = time.time()
    sum_squared = 0
    for i in items:
        sum_squared += f2(i)
    return time.time() - start

this sequential computation

problem_size = 2

import time
start = time.time()
tlist = []
for i in range(5):
    tlist.append(benchmking_f(problem_size))
print('for loop took {}s'.format(time.time() - start))
print('each iterate took')
print(tlist)

took about 70s to finish the job; each iterate took [14.209498167037964, 13.92169737815857, 13.949078798294067, 13.94432258605957, 14.004642486572266]

this parallel approach

problem_size = 2

import itertools
import multiprocessing
start = time.time()
pool = multiprocessing.Pool(5)
tlist = list(pool.map(benchmking_f, itertools.repeat(problem_size, 5)))
print('pool.map took {}s'.format(time.time() - start))
print('each iterate took')
print(tlist)

took about 42.45s; each iterate took [41.17476940155029, 41.92032074928284, 41.50966739654541, 41.348535776138306, 41.06284761428833]

question

A piece of the whole computation (benchmking_f in this case) took about 14s in sequential and 42.45s in parallel

Why is that?

Note: I am not asking the total time. I am asking the time that A piece of the whole computation, which takes on one iteration in for loop, and one process/thread in parallel.

1-iter benchmking_f takes.

@kindall thanks for your reminder. I've updated. I am not asking the total time. I am asking the time that A piece of the whole computation, which takes on one iteration in for loop, and one process/thread in parallel. — , Sep 03 '19 at 05:04

Tim Peters · Accepted Answer · 2019-09-03T16:08:25.143

How many physical (not logical) cores do you have? You're trying to run 5 copies of the function simultaneously, the function takes 100% of one core for as long as it runs, and unless you have at least 5 physical cores they're going to fight each other tooth and nail for cycles.

I have 4 physical cores, but want to use my machine for other things too, so reduced your Pool(5) with Pool(3). Then the per-iterate timings were about the same either way.

Back of the envelope

Suppose you have a task that nails 100% of a CPU for T seconds. If you want to run S copies of that task simultaneously, that requires T*S cpu-seconds in total. If you have C entirely free physical cores to throw at it, at most min(C, S) cores can be working on the aggregate simultaneously, so to a first approximation the time needed will be:

T*S / min(C, S)

As another reply said, when you have more processes running than cores, the OS cycles through the processes for the duration, acting to make them all take about the same amount of wall-clock time (during some amount of which each process is doing nothing at all except waiting for the OS to let it run again for a while).

I'm guessing you have 2 physical cores. For your example, T is about 14 seconds, and S is 5, so if you had C=2 cores that works out to

14*5 / min(2, 5) = 14*5/2 = 35

seconds. You're actually seeing something closer to 41. Overheads account for part of that, but seems likely your machine was also doing other work at the same time, so your test run didn't get 100% of the 2 cores.

Thanks for your answer. Does "per-iterate timings" mean the time benchmking_f takes in each of your 3 threads? — , Sep 04 '19 at 10:14
Yes, on my box, with `Pool(3)` the serial run gives a list of 5 numbers each in 14 to 16 seconds, and the parallel run the same (although a little bigger than in the serial run), With `Pool(4)`, each in 15 to 18 seconds, reflecting that I have 4 physical cores, and other processes need cycles too (the OS, two browsers with a dozen open tabs, etc). Remember that `time.time()` measures wall-clock time! When a CPU-bound process doesn't get 100% of a CPU to run non-stop, some of its wall-clock time is just sitting idle waiting for the OS to let it run again. — Tim Peters, Sep 04 '19 at 15:22

score 2 · Answer 2 · answered Sep 03 '19 at 04:39

2

Total time is reduced: 70 second vs 42 second.

Your computer is processing 5 things at the same time, probably in a round-robin fashion. Threading overhead (context load etc.) occurs and each thread took longer. However, since longer threads are run in parallel, 5 threads finished within 42 second.

For sequential, your computer is processing the same thing 5 times. Each thread can run until it finishes without interrupt (hence, no overhead). Yet, all takes 70 second to finish.

answered Sep 03 '19 at 04:39

Edward Aung

3,014
1
12
15

By "threading", do you actually mean processes? `multiprocessing` is a package that supports spawning processes using an API similar to the threading module. – Sep 03 '19 at 04:49
Yes, both of you are correct. threading = process here and the overhead are more prominent (because process overheads is larger). – Edward Aung Sep 03 '19 at 04:58
1

@SethMMorton What is the difference between OS level and Program level when talking about threading? – Sep 04 '19 at 02:58
1

@singularli You May find this useful: https://stackoverflow.com/q/3042717/1399279 – SethMMorton Sep 04 '19 at 03:03

Why is the time a piece of the whole computaion (benchmking_f in this case) takes in parallel so much longer than the one in sequential approach?

question

2 Answers2

Back of the envelope