Why is multiprocessed code in given code taking more time than usual sequential execution?

Question

So I tried my hands on multiprocessing in python and tried to execute a simple map function using both the techniques and did the benchmarking. However the strange thing that occurred is that it actually took more time in the code where I created 4 pools. Following is my general code:

from datetime import datetime
from multiprocessing.dummy import Pool as ThreadPool
def square(x):
    return x*x

l = xrange(10000000)
map(square, l)

Executing this code took about 1.5 secs

Now I created 4 pools for multiprocessing using following code:

from datetime import datetime
from multiprocessing.dummy import Pool as ThreadPool
def square(x):
    return x*x
l = xrange(10000000)
pool = ThreadPool(4) 
results = pool.map(square, l)
pool.close() 
pool.join()

Now when I benchmarked it, multiprocessed code actually took more time(around 2.5 secs). Since it is a cpu bound task, I am a bit confused as in why it took more time when it actually should have taken less. Any views on what I am doing wrong?

Edit - Instead of multiprocessing.dummy I used multiprocessing and it was still slower. Even more slower.

Also, you are not multiprocessing, you are multi_threading_. `multiprocessing.dummy` is just a wrapper around threading-library. https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.dummy . That is why GIL is relevant here. — Teemu Risikko, Feb 17 '17 at 07:24
@niemmi Well but this is a process I am creating and not a thread. Correct me if I am wrong. GIL applies when you create multiple threads. Multiple processes can use multiple cores in python as far as I know. — Rishabh, Feb 17 '17 at 07:25
You are comparing apples and oranges: xrange != range! The second version has the overhead of creating that whole list first! — MKesper, Feb 17 '17 at 07:25
@MKesper - Oh that was just a mistake I did while copying the code. I checked both with xrange originally and it was still slower. — Rishabh, Feb 17 '17 at 07:27
_this is a process I am creating and not a thread._ It is not a process, see my another comment. — Teemu Risikko, Feb 17 '17 at 07:28
TeemuRisikko - Well that kind of make sense here. But as far as I know it creates multiple processes and is not bound to limitations of GIL. Correct me if I am wrong. — Rishabh, Feb 17 '17 at 07:30
To clarify Teemu's comment: You want to use 'from multiprocessing import Pool'. — MKesper, Feb 17 '17 at 07:31
@MKesper - I used from multiprocessing import Pool and guess what it took even more time something ~5 secs. This creates another question altogether :D — Rishabh, Feb 17 '17 at 07:34
See e4c5's comment: Creating processes has a high overhead (even higher than threads). Create jobs that run for minutes, then test again. — MKesper, Feb 17 '17 at 07:38
The two code segments above are exactly the same. I am guessing the former code block should be from multiprocessing import Pool as ThreadPool. — satishgoda, Feb 17 '17 at 08:31

score 1 · Answer 1 · answered Feb 17 '17 at 07:16

1

This is not surprising. Your test is a very poor test. You use threads for long running tasks. But what you are testing is a function that returns almost instantly. Here the primary factor is the overhead of setting up threads. That far outweighs any benefits you will possibly get from threading.

answered Feb 17 '17 at 07:16

e4c5

52,766
11
101
134

Well yes you are correct about the overhead of setting up threads here and that was my initial thought too. But then I wanted to create multiple processes here which utilises multiple cores of CPU, as my task is a CPU bound task. So it is not such a bad use case, but yes you are right in perspective. – Rishabh Feb 17 '17 at 07:37

score 0 · Answer 2 · edited May 23 '17 at 12:00

The problem is that you're using dummy. I.e. multithreading, not multiprocessing. Multithreading won't make CPU bound tasks faster, but only I/O bound tasks.

Try again with multiprocessing.Pool and you should have more success.

multiprocessing.dummy in Python is not utilising 100% cpu

Also you need to somehow chunk your input sequence into subsequences to make every process do enough calculations that it's worth it.

I put this into a solution. See that you need to call the multiprocessed pool from only on main execution, the problem is Python starts subengines that do each mapping.

import time
from multiprocessing import Pool as ThreadPool

def square(x):
    return x*x

def squareChunk(chunk):
    return [square(x) for x in chunk]

def chunks(l, n):
    n = max(1, n)
    return (l[i:i+n] for i in range(0, len(l), n))

def flatten(ll):
    lst = []
    for l in ll:
        lst.extend(l)
    return lst

if __name__ == '__main__':
    start_time = time.time()
    r1 = range(10000000)
    nProcesses = 100
    chunked = chunks(r1, int(len(r1)/nProcesses)) #split original list in decent sized chunks
    pool = ThreadPool(4) 
    results = flatten(pool.map(squareChunk, chunked))
    pool.close() 
    pool.join() 
    print("--- Parallel map %g seconds ---" % (time.time() - start_time))

    start_time = time.time()
    r2 = range(10000000)
    squareChunk(r2)
    print("--- Serial map %g seconds ---" % (time.time() - start_time))

I get the following printout:

--- Parallel map 3.71226 seconds ---
--- Serial map 2.33983 seconds ---

Now the question is shouldn't the parallel map be faster?

It could be that the whole chunking is costing us efficiency. But it could also be that the engine is more "warmed up" when the serial processing runs after. So I turned around the measurements:

import time
from multiprocessing import Pool as ThreadPool

def square(x):
    return x*x

def squareChunk(chunk):
    return [square(x) for x in chunk]

def chunks(l, n):
    n = max(1, n)
    return (l[i:i+n] for i in range(0, len(l), n))

def flatten(ll):
    lst = []
    for l in ll:
        lst.extend(l)
    return lst

if __name__ == '__main__':
    start_time = time.time()
    r2 = range(10000000)
    squareChunk(r2)
    print("--- Serial map %g seconds ---" % (time.time() - start_time))

    start_time = time.time()
    r1 = range(10000000)
    nProcesses = 100
    chunked = chunks(r1, int(len(r1)/nProcesses)) #split original list in decent sized chunks
    pool = ThreadPool(4) 
    results = flatten(pool.map(squareChunk, chunked))
    pool.close() 
    pool.join() 
    print("--- Parallel map %g seconds ---" % (time.time() - start_time))

And now I got:

--- Serial map 4.176 seconds ---
--- Parallel map 2.68242 seconds ---

So it's not so clear whether one or the other is faster. But if you want to do multiprocessing you have to think whether the overhead of creating the threads is actually much smaller than what you expect in speedup. You run into cache locality issues etc.

I see your comment, I also made an edit, because you need to have a big enough problem for each thread to solve. Calculating x*x is not a big enough problem. — CodeMonkey, Feb 17 '17 at 07:48

Why is multiprocessed code in given code taking more time than usual sequential execution?

2 Answers2

Linked