2

Multi-threading in CPython cannot use more than one CPU in parallel because the existence of GIL. To break this limitation, we can use multiprocessing. I'm writing Python code to demonstrate that. Here is my code:

from math import sqrt
from time import time
from threading import Thread
from multiprocessing import Process


def time_recorder(job_name):
    """Record time consumption of running a function"""
    def deco(func):
        def wrapper(*args, **kwargs):
            print(f"Run {job_name}")
            start_epoch = time()
            func(*args, **kwargs)
            end_epoch = time()
            time_consume = end_epoch - start_epoch
            print(f"Time consumption of {job_name}: {time_consume}")
        return wrapper
    return deco


def calc_sqrt():
    """Consume the CPU"""
    i = 2147483647
    for j in range(20 * 1000 * 1000):
        i -= 1
        sqrt(i)


@time_recorder("one by one")
def one_by_one():
    for _ in range(8):
        calc_sqrt()


@time_recorder("multi-threading")
def multi_thread():
    t_list = list()
    for i in range(8):
        t = Thread(name=f'worker-{i}', target=calc_sqrt)
        t.start()
        t_list.append(t)
    for t in t_list:
        t.join()


@time_recorder("multi-processing")
def multi_process():
    p_list = list()
    for i in range(8):
        p = Process(name=f"worker-{i}", target=calc_sqrt)
        p.start()
        p_list.append(p)
    for p in p_list:
        p.join()


def main():
    one_by_one()

    print('-' * 40)
    multi_thread()

    print('-' * 40)
    multi_process()


if __name__ == '__main__':
    main()

Function "calc_sqrt()" is the CPU-consuming job, which calculates square root for 20 million times. Decorator "time_recorder" calculates the running time of the decorated functions. And there are 3 functions which run the CPU-consuming job one by one, in multiple threads and in multiple processes respectively.

By running the above code on my laptop, I got the following output:

Run one by one
Time consumption of one by one: 39.31295585632324
----------------------------------------
Run multi-threading
Time consumption of multi-threading: 39.36112403869629
----------------------------------------
Run multi-processing
Time consumption of multi-processing: 23.380358457565308

Time consumption of "one_by_one()" and "multi_thread()" are almost the same, which are as expected. But time consumption of "multi_process()" is a little bit confusing. My laptop has an Intel Core i5-7300U CPU, which has 2 cores, 4 threads. Task manager simply shows that there are 4 (logic) CPUs in my computer. Task manager also shows that the CPU usage of all the 4 CPUs are 100% during the execution. But the processing time didn't reduce to 1/4 but rather 1/2, why? The operating system of my laptop is Windows 10 64-bit.

Later, I tried this program on a Linux virtual machine, and got the following output, which is more reasonable:

Run one by one
Time consumption of one by one: 33.78603768348694
----------------------------------------
Run multi-threading
Time consumption of multi-threading: 34.396817684173584
----------------------------------------
Run multi-processing
Time consumption of multi-processing: 8.470374584197998

This time, processing time with multi-processing reduced to 1/4 of that with multi-threading. Host of this Linux server equipped with an Intel Xeon E5-2670, which has 8 cores and 16 threads. The host OS is CentOS 7. The VM is assigned with 4 vCPUs and the OS is Debian 10.

The questions are:

  • why didn't the processing time of the multi-processing job reduce to 1/4 but rather to just 1/2 on my laptop?
  • Is it a CPU issue, which means that the 4 threads of Intel Core i5-7300U are not "real parallel" and may impact each other, and Intel Xeon E5-2670 doesn't have that issue?
  • Or is it an OS issue, which means that Windows 10 doesn't support multi-processing well, processes may impact each other when running in parallel?
Vespene Gas
  • 3,210
  • 2
  • 16
  • 27
  • It is likely both... the multi-processing implementation on Windows has a much larger overhead for spinning up new processes, and your laptop may have 2 cores and 4 threads but some of those resources are being used by the operating system itself, as well as any other background tasks which on a windows pc tend to be many. Also laptop processors tend to be designed for low power consumption which could be another bottleneck. – Alexander Feb 27 '23 at 06:57
  • Results will vary considerably depending on your platform - both hardware and OS. For example, on my machine I get one_by_one = 13.8, multi-threading 13.1, multi-processing 1.9. Having said that, your implementations are naïve and could almost certainly be improved. Also worth mentioning that you should never implement multithreading for CPU-intensive work. You won't gain anything – DarkKnight Feb 27 '23 at 07:25
  • @Pingu 1. Can you please tell that what improvement can I make to my implementation? 2. What OS and CPU are you using? 3. Regarding "never implement multithreading for CPU-intensive work", of course I know it, actually my code example is just demonstrating this. – Vespene Gas Feb 27 '23 at 07:35
  • 1
    Using concurrent.futures.ProcessPoolExecutor is a better pattern especially when/if you want to manage the maximum number of concurrent processes. My OS is macOS 13.2.1 running on 3 GHz 10-Core Intel Xeon W with 32GB RAM – DarkKnight Feb 27 '23 at 08:09
  • @Pingu 1. OK, got it. I can use process pool in the production code, but regarding to the code which demonstrates the difference of CPU consumption between threading and multi-processing, I just don't want to introduce irrelevant knowledge to the reader. 2. MacOS multi-processing is, to some extent, similar to Linux. And your 10-core CPU is Intel Intel Xeon rather than Intel Core. So the processing time of multi-processing call is 1/7, which is similar to 1/10, and that is as expected. What I'm confusing is that the Windows/Intel Core platform has so bad performance. – Vespene Gas Feb 27 '23 at 08:39

2 Answers2

3

As said by @Pingu in comments, the speed gain very much depends on the number of cores of your machine. Your machine only has two physical cores (4 hardware threads, also called logical cores), which are probably partly occupied by your OS threads. Not only a machine with more cores will likely be more performant at multiprocessing but the OS bookkeeping will occupy less total CPU and will have a less significant impact on the performance.

Here is an implementation of your test code that let you change the number of threads/process on which to execute N_TASKS concurrent calls to calc_sqrt:

from math import sqrt
from time import time
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor


N_WORKERS = 8
N_TASKS = 32

def time_recorder(job_name):
    """Record time consumption of running a function"""
    def deco(func):
        def wrapper(*args, **kwargs):
            print(f"Run {job_name}")
            start_epoch = time()
            out = func(*args, **kwargs)
            end_epoch = time()
            time_consume = end_epoch - start_epoch
            print(f"Time consumption of {job_name}: {time_consume:.6}s")
            return out
        return wrapper
    return deco


def calc_sqrt(_):
    i = 2147483647
    for _ in range(5 * 1000 * 1000):
        i -= 1
        sqrt(i)


@time_recorder("one by one")
def one_by_one():
    _ = [calc_sqrt(_) for _ in range(N_TASKS)]


@time_recorder("multi-threading")
def multi_thread():
    with ThreadPoolExecutor(max_workers=N_WORKERS) as e:
        _ = e.map(calc_sqrt, range(N_TASKS))


@time_recorder("multi-processing")
def multi_process():
    with ProcessPoolExecutor(max_workers=N_WORKERS) as e:
        _ = e.map(calc_sqrt, range(N_TASKS), chunksize=1)


def main():
    one_by_one()

    print('-' * 40)
    multi_thread()

    print('-' * 40)
    multi_process()


if __name__ == '__main__':
    main()

On my machine (M1 Pro MacBook Pro 14") here are the approximate timings for different number of threads/processes:

# threads/processes Sequential Multithreading Multiprocessing
1 10s 10s 10s
2 10s 10s 5.5s
4 10s 10s 2.8s
6 10s 10s 2.2s
8 10s 10s 1.8s
10 10s 10s 1.8s
12 10s 10s 1.8s

As you can see the performance is quite proportional to the number of cores on the multiprocessing variant. This is roughly the behavior you can observe on your machines: near a 2x performance gain on your 2 cores machine and almost 4x on the 4 cores one.

You can observe a saturation at 8 cores (there is no improvement with 10 concurrent processes), which indicates that my machine likely has 8 physical cores.

Note that there is a difference between a CPU physical cores and hardware threads (also called hyper-threading). The Core i5-7300U CPU has 4 hardware threads but this is not equivalent to a 4 (physical) cores machine. Hyper-threading can improve the performance of a CPU multiprocessing capability but it is generally lower than adding more physical cores. For instance, Intel claims a 15% to 30% performance gain due to hyper-threading, which is far from the 2x performance gain you could imagine when reading "2-cores / 4-threads" on the CPU specs.

Louis Lac
  • 5,298
  • 1
  • 21
  • 36
  • That claim of only a 15 to 30% performance boost was for Pentium 4, which was often limited by its small trace cache (and weak decoders), and two threads running different code made that much worse. And it wasn't as wide as modern CPUs like the OP's Skylake, and didn't have as deep out-of-order exec windows. Given the diminishing returns on all of those things for finding more instruction-level parallelism in typical code, hyperthreading is much more valuable now than it was in Pentium 4. But some workloads don't scale well when a single thread can already use most of a core's resources. – Peter Cordes Feb 28 '23 at 22:41
  • (But I think in the P4 days, even that 30% claim was optimistic. As [How much of ‘What Every Programmer Should Know About Memory’ is still valid?](https://stackoverflow.com/a/47714514) discusses, back in those days, it was so weak that it was usually better not to use it, or use the other logical core to run just a prefetch thread to prefetch from a large array the other thread is looping over. That's not a thing anymore.) – Peter Cordes Feb 28 '23 at 22:44
  • @PeterCordes True, I'm also pessimistic on this Intel claim and many have shown the inefficiency of this optimization. In general it should not be relied on for (most) multiprocessing optimisations. – Louis Lac Mar 01 '23 at 10:07
  • It's not unreasonable for modern CPUs; even for fairly high throughput code like x265 video encoding (1080p `-preset slower`), I got a 20% speedup from 8 threads instead of 4 on my 4c8t i7-6700k. [How to test performance of Intel HyperThreading in Linux X86\_64](https://stackoverflow.com/q/65788455) . For number-crunching / HPC stuff, it can be a negative speedup due to cache competition. (It's also good for leaving a core idle for interactive use when running 1 thread per physical core.) – Peter Cordes Mar 01 '23 at 10:13
1

The parallelism between two logical cores sharing a physical core is complicated. https://en.wikipedia.org/wiki/Simultaneous_multithreading (Intel brands their implementation of SMT as "hyperthreading"). There's only one set of execution units, and the front-end alternate cycles between threads when both aren't stalled. Out-of-order exec in the back-end does happen on uops from both logical cores (confusingly called "hardware threads") at the same time.

If you wrote this in C, you'd get no benefit from hyperthreading for square roots, since the FP div/sqrt unit is slowish compared to everything else, very easy for one thread to max it out. (assuming it compiles to a loop doing cvtsi2sd and sqrtsd double-precision square root, which has plenty of precision). Unlike most instructions division (and square root) aren't fully pipelined on modern CPUs: the execution unit can't start working on a new one every clock cycle. And there's only one such execution unit on your Kaby Lake CPU.

But this is Python; it's taking about 40 seconds for 20M square roots on a single core, taking about 2 microseconds per sqrt!!! At 3.5GHz, that's 7000 clock cycles per square root, vs. an average throughput of one per 4.5 cycles for the sqrtsd asm instruction on Kaby Lake (https://uops.info/, check the SSE or AVX instruction set).

So either interpreter overhead is even larger than usual, or its doing some fancy integer square root thing instead of taking advantage of double for small-enough integers. (Python integers are arbitrary precision). So it's just a coincidence that hardware FPU sqrt throughput would be the bottleneck for a C program, Python is obviously not doing that, or doing so much else around it that the HW div/sqrt unit is busy for a trivial amount of time. Unless a lot of the time Python is spending is on integer division, which is also not fully pipelined.


Hyperthreading is useful when a single thread can't max out the execution resources of a single core, especially because of branch mispredictions, cache misses, or low instructions-per-cycle due to data dependencies (e.g. one long chain of FP adds or multiplies, so there's no instruction-level parallelism for the CPU to find).

Apparently the workload you picked isn't like that. Many do get some speedup. (Often not linear algebra stuff, though; good BLAS libraries can max out the FMA execution units with one thread for things like matmul, and having 2 threads per core competing for the same cache tends to make things worse.)


Related:

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847