Why does this CPU computation test give different resuts on Windows and Linux (using WSL on Windows) running Python 3 on the same machine?

Question

I wrote a simple test to understand single thread, multithread and multiprocessing in Python 3. The code is given below:

#import libraries
from multiprocessing import Pool
import time
import threading

def calculate_sum_upto(n):
    sum = 0
    for i in range(n):
        sum += i
    # print("Sum : " + str(sum))

def test_all(limit):
    print("\nFor sum of series upto : " + str(limit))
    # Define input case, that is an array of numbers
    array_of_numbers = [limit for i in range(8)]

    # Adding time for performace calculation
    start_time_1 = time.time()

    # First, let's try using raw approach
    # print("\nStarting Raw approach...\n")
    for num in array_of_numbers:
        calculate_sum_upto(num)
    # print("result obtained using raw approach : " + str(super_sum_raw))
    # print("\nRaw approach finished.")

    end_time_1 = time.time()

    start_time_2 = time.time()

    # Now trying using parallel processing
    # print("\n\nStarting multiprocessing approach...\n")
    pool = Pool()
    super_sum_optimized_values = pool.map(calculate_sum_upto, array_of_numbers)
    pool.close()
    pool.join()
    # print("result obtained using parallel processing approach : " + str(super_sum_optimized))
    # print("\nParallel Processing approach finished.")

    end_time_2 = time.time()

    start_time_3 = time.time()
    # Trying using general threading approach
    # print("\n\nStarting Threading approach...\n")
    thread_array = [threading.Thread(target=calculate_sum_upto, args=(num,)) for num in array_of_numbers]
    for thread in thread_array:
        thread.start()

    for thread in thread_array:
        thread.join()
    # print("\nThreading approach finished.\n\n")
    end_time_3 = time.time()

    # Printing results
    print("\nRaw approach : {:10.5f}".format(end_time_1 - start_time_1))
    print("Multithreading approach : {:10.5f}".format(end_time_3 - start_time_3))
    print("Multiprocessing approach : {:10.5f}".format(end_time_2 - start_time_2))

if __name__ == "__main__":
    # print("This test bench records time for calculating sum of series upto n terms for 4 numbers using 3 approaches : \n1 : Linear calculation for each number one after the other.\n2 : Calculating sum of series for 4 numbers on 4 different threads.\n3 : Calculating sum of series for 4 numbers on 4 different processes.")
    # print("For simplicity, all 4 numbers have the same value, i.e. sum of series upto n terms for m, 4 times.")
    n = 10000
    # for i in range(5):
    #     test_all(n)
    #     n *= 10
    test_all(10000000)

    print("\n\nEnd of test.")

However, I tried running this test two ways:

Directly from Powershell on windows 10
Using Ubuntu 18.04 terminal on WSL on the same machine.

However, I am getting more than 1 second of performance improvement while using Ubuntu. Why is that? Should not they be the same since it is the same machine?

TESTING ON A QUAD CORE CPU
[AMD Ryzen 3 3200G 3.6 Ghz, 4 Core(s), 4 Logical Processor(s)]

Windows :

For sum of series upto : 10000000
Raw approach :    5.08537
Multithreading approach :    5.52041
Multiprocessing approach :    1.40911

Ubuntu Linux using WSL :

For sum of series upto : 10000000
Raw approach :    3.60763
Multithreading approach :    3.70080
Multiprocessing approach :    0.93371

It may be important to know exactly which versions of Python you're testing in Windows and WSL. — Eryk Sun, Nov 29 '19 at 14:20
As stated above, it's Python 3. Exact version is 3.7 on both. But why should that matter ? Specially in single threaded performance ? Maybe some implementation specifics of threading changed, but single thread performance should remain constant since the computation is simple math with no fancy functions ? — Shivang Gangadia, Nov 30 '19 at 16:17
There used to be an optimization for summing integers that unboxed the values as a native CPU data type, if possible. IIRC, the data type used by the optimization was a C `long`, which is always 32-bit in Windows, whether the OS is 32-bit or 64-bit. But in 64-bit Linux it's a 64-bit value. So the optimization would work on all values in the `sum += i` operation in Linux, but in Windows it would have to fall back on using the much slower summation function of the variable-size `int` type. — Eryk Sun, Nov 30 '19 at 17:00
So maybe if I replace the sum by multiplication function and reduce the size of input value, maybe I should see similar results ? Or will the optimization for sum be used in the backend for multiplication as well ? — Shivang Gangadia, Nov 30 '19 at 19:40
The key word there is that there "used" to be an optimization along those lines. A core developer removed it, but I don't recall whether it was in 3.8 or 3.7. — Eryk Sun, Nov 30 '19 at 20:42
There can be lots of other reasons for a speed difference, such as the relative efficiency of managing memory. For Windows programs, heap management is a user-mode facility in the NT runtime library. It may not be optimized for the Python interpreter compared to the corresponding memory-management facility in Linux / WSL. Maybe the heap manager in Windows tries harder to minimize memory fragmentation, which would benefit more a long-running program that frequently allocates varying sizes of memory blocks. — Eryk Sun, Nov 30 '19 at 20:48
Pardon me for my naive question, but this particular program focuses on computation more than on memory requirements. Also, I opened task manager alongside both the tests, and in the multiprocessing test, all 4 cores for both the test showed "almost identical" graphs. Should I post those here if that helps in identifying the reason for the difference ? — Shivang Gangadia, Dec 02 '19 at 05:51
The fundamental difference is in the 'raw' result. Bring multiprocessing into the question doesn't change this, and only adds to the complexity due to the extremely different implementations of multiprocessing in Unix vs Windows. You should first run your test under a [profiler](https://docs.python.org/3/library/profile.html) to see where time is spent. — Eryk Sun, Dec 02 '19 at 11:25
As to memory, every `i` value from the `range` iterator and every new `sum` value is an `int` object that's allocated on the heap. These are small objects, and the interpreter has its own small-object allocator that's implemented optimally for Unix and Windows using direct allocation of virtual memory. I'd hope that the performance is comparable. — Eryk Sun, Dec 02 '19 at 11:26

Ray Tayek · Answer 1 · 2019-12-06T01:08:14.150

0

Some of the difference may be due to the fact that linux has different kinds of threads and linux does a fork while windoze starts a process. related: Multiprocessing vs Threading Python, Multithreading VS Multiprocessing in Python, multiprocessing — Process-based parallelism, Threading vs Multiprocessing in Python

edited Dec 06 '19 at 01:08

answered Dec 06 '19 at 00:45

Ray Tayek

9,841
8
50
90

I've run task manager during windows execution and linux execution. Both the times, the usage of all cpu cores was similar, hence they did seem to have created 4 processes for computation. Also, even if Fork and CreateProcess work differently, that should only afftect load time. Once the process has been created, execution will continue the same. Hence this should not affect the execution time. – Shivang Gangadia Dec 07 '19 at 14:24

Why does this CPU computation test give different resuts on Windows and Linux (using WSL on Windows) running Python 3 on the same machine?

1 Answers1