Why does more workers than processors in Multiprocessing Python improve runtime?

Question

I am playing around with concurrent.future in python as a means to understand a few simple implementations that use multiprocessing. However, I've come across a very unexpected result. Before I begin, here are my system details:

Computer Type: Laptop w/ Windows 10

Ram: 8.00 GB

CPU: Intel(R) Core(TM) i7-6600U @ 2.60GHzBase Speed: 2.80 GHz

Sockets: 1

Cores: 2

Logical Processors: 4

L1 cache: 128KB

L2 cache: 512KB

L3 cache: 4.0MB

Take the following geometric series that computes the mean of n numbers:

With this idea in mind, I create a function that computes the mean of the integers between a low bound a (inclusive) and a high bound b (exclusive). I then run a test with and without multiprocessing on a range of 500 million integers:

import time
import concurrent.futures

def mean(a, b):
    total_sum = 0
    for next_int in range(a, b):
        total_sum += next_int
    return total_sum / (b - a)

if __name__ == '__main__':
    n = 500000000              # 500 Million
    wall_time = time.time()
    base_ans = mean(0, n)      # From 0 to n-1.
    print("Single Thread Time: " + str(time.time() - wall_time) + " sec.")

    work = [(0, int(n/2)), (int(n/2), n)]
    num_workers = 2            # One process per core!
    test_ans = 0
    wall_time = time.time()

    with concurrent.futures.ProcessPoolExecutor(max_workers=num_workers) as executor:
        future_tasks = {executor.submit(mean, job[0], job[1]): job for job in work}
        for future in concurrent.futures.as_completed(future_tasks):
            test_ans += future.result()

    print("Multiprocessing Time: " + str(time.time() - wall_time) + " sec.")
    print(str(base_ans) + " == " + str(test_ans / num_workers) + " => " + str(base_ans == (test_ans / num_workers)))

The following produces the following output:

Single Thread Time: 41.0769419670105 sec.     # CPU Utilization ≈ 35% (from task manager)
Multiprocessing Time: 24.71605634689331 sec.  # CPU Utilization ≈ 70% (from task manager)

As we can clearly see, a major speed up was observed (roughly 1.66x). However, if I create 4 workers, instead of 2, I get an even greater speed up:

work = [(0, int(n/4)), (int(n/4), int(n/2)), (int(n/2), int(3*n/4)), (int(3*n/4), n)]
num_workers = 4
# ...
Single Thread Time: 41.51883292198181 sec.     # CPU Utilization ≈ 35% (from task manager)
Multiprocessing Time: 18.18532919883728 sec.  # CPU Utilization = 100% (from task manager)

An even greater speed up can be seen here (roughly 2.28x) and it is even consist over many runs!

Since only two processes can simultaneously run on this two (physical) core system, is the efficiency of Window's scheduler the reason for this continued speed up?
How can I choose a max_worker number that provides the fastest runtime? How many more processes should I add past the physical core count?
And lastly, does adding more processes past the physical core count affect the threads (in multithreading) within each process from running efficiency?

You realize logical processors do provide an actual speed up in many cases, right? Sure, the gain isn't as big, but it's non-zero (up to 30% in ideal cases). They didn't just make [hyperthreaded chips for no reason](https://en.wikipedia.org/wiki/Hyper-threading#Performance_claims). — ShadowRanger, Jun 11 '18 at 14:18
This answer may be relevant https://stackoverflow.com/questions/1718465/optimal-number-of-threads-per-core — Ed Smith, Jun 11 '18 at 14:19

Why does more workers than processors in Multiprocessing Python improve runtime?

0 Answers0