multiprocessing.Pool vs multiprocessing.pool.ThreadPool

Question

Here is some test of multiprocessing.Pool vs multiprocessing.pool.ThreadPool vs sequential version, I wonder why multiprocessing.pool.ThreadPool version is slower than sequential version?

Is it true that multiprocessing.Pool is faster because it use processes (i.e. without GIL) and multiprocessing.pool.ThreadPool use threads(i.e. with GIL) despite the name of the package multiprocessing?

import time


def test_1(job_list):
    from multiprocessing import Pool

    print('-' * 60)
    print("Pool map")
    start = time.time()
    p = Pool(8)
    s = sum(p.map(sum, job_list))
    print('time:', time.time() - start)


def test_2(job_list):
    print('-' * 60)
    print("Sequential map")
    start = time.time()
    s = sum(map(sum, job_list))
    print('time:', time.time() - start)


def test_3(job_list):
    from multiprocessing.pool import ThreadPool

    print('-' * 60)
    print("ThreadPool map")
    start = time.time()
    p = ThreadPool(8)
    s = sum(p.map(sum, job_list))
    print('time:', time.time() - start)


if __name__ == '__main__':
    job_list = [range(10000000)]*128

    test_1(job_list)

    test_2(job_list)

    test_3(job_list)

Output:

------------------------------------------------------------
Pool map
time: 3.4112906455993652
------------------------------------------------------------
Sequential map
time: 23.626681804656982
------------------------------------------------------------
ThreadPool map
time: 76.83279991149902

Python multithreading is less useful than you think it is. The Global Instruction Lock (GIL) means that only one thread can use the Python interpreter at a time. Since your code is pure python interpreter, nothing much is gained. Multithreading works best when your threads are waiting for external resources. Google "Global Instruction Lock Python" for more detailed info. — Frank Yellin, Jan 13 '22 at 17:43
By the way, your test program is a really nice one! Create a lot of work with very little data transferred. — Frank Yellin, Jan 13 '22 at 17:45

ShadowRanger · Accepted Answer · 2022-01-13T18:12:00.197

Your tasks are purely CPU bound (no blocking on I/O) and are not using any extension code that manually releases the GIL to do large amounts of number-crunching without the involvement of Python-level reference counted objects (e.g. hashlib hashing large inputs, large array numpy computations, etc.). As such, the definition of the GIL prevents you from extracting any parallelism from the code; only one thread can hold the GIL at once and execute Python bytecode, and you go slower because:

You have to launch all these threads
They have to hand off the GIL between themselves to simulate parallel processing
You have to clean up all the threads

In short, yes, ThreadPool does what it says on the tin: It provides the same API as Pool, but backed by threads, not worker processes, and therefore does not avoid GIL limitations and overhead. It wasn't even documented directly until recently; instead, it was indirectly documented by the multiprocessing.dummy docs that were even more explicit about providing the multiprocessing API but backed by threads, not processes (you used it as multiprocessing.dummy.Pool, without the name actually including the word "Thread").

I'll note that your test makes Pool look better than it normally would. Usually, Pool will do poorly with tasks like this (lots of data, little computation relative to size of data), because the cost of serializing the data and sending it to the child processes outweighs the minor gains from parallelizing the work. But since your "large data" was represented by range objects (which are serialized cheaply, as a reference to the range class and the arguments to reconstruct it with), very little data is transferred to and from the workers. If you used real data (realized lists of int), the benefits from Pool would go down dramatically. For example, just by changing the definition of job_list to:

job_list = [[*range(10000000)]] * 128

the time for Pool on my machine (which takes 3.11 seconds for your unmodified Pool case) jumps to 8.11 seconds. And even that's a lie, because the pickle serialization code recognizes the same list repeated over and over and serializes the inner list just once, then repeats it with a quick "see that first list" code. I'd tell you what using:

job_list = [[*range(10000000)] for _ in range(128)]

did to the runtime, but I nearly crashed my machine just trying to run that line (it would require ~46 GB of memory to create said list of lists, and that cost would be paid once in the parent process, then again across the children); suffice to say, Pool would lose quite badly especially in cases where the data fits in RAM once, but not twice.

multiprocessing.Pool vs multiprocessing.pool.ThreadPool

1 Answers1

Linked