No gain from multiple threads when using ThreadPoolExecutor

Question

I'm trying to simulate some processes in order to get some statistics. I decided to write simulation program using multiple threads as each test run is independant.

It means that if I need to perform e.g. 1000 test runs then it should be possible to have 4 threads (each doing 250 test runs).

While doing this I found that addition of multiple threads does not decrease simulation time.

I have Windows 10 laptop with 4 physical cores.

I wrote a simple program which shows behaviour I'm talking about.

from concurrent.futures import ThreadPoolExecutor
import time
import psutil
import random


def runScenario():
    d = dict()
    for i in range(0, 10000):
        rval = random.random()
        if rval in d:
            d[rval] += 1
        else:
            d[rval] = 1
    return len(d)    

def runScenarioMultipleTimesSingleThread(taskId, numOfCycles):
    print('starting thread {}, numOfCycles is {}'.format(taskId, numOfCycles))

    sum = 0
    for i in range(numOfCycles):
        sum += runScenario()

    print('thread {} finished'.format(taskId))

    return sum

def modelAvg(numOfCycles, numThreads):

    pool = ThreadPoolExecutor(max_workers=numThreads)

    cyclesPerThread = int(numOfCycles / numThreads)
    numOfCycles = cyclesPerThread * numThreads

    futures = list()
    for i in range(numThreads):
        future = pool.submit(runScenarioMultipleTimesSingleThread, i, cyclesPerThread)
        futures.append(future)

    sum = 0
    for future in futures:
        sum += future.result()

    return sum / numOfCycles


def main():
    p = psutil.Process()
    print('cpus:{}, affinity{}'.format(psutil.cpu_count(), p.cpu_affinity() ))

    start = time.time()
    modelAvg( numOfCycles = 10000, numThreads = 4)
    end = time.time()

    print('simulation took {}'.format(end - start))

if __name__ == '__main__':
    main()

These are the results:

One thread:

cpus:8, affinity[0, 1, 2, 3, 4, 5, 6, 7]
starting thread 0, numOfCycles is 10000
thread 0 finished
simulation took 23.542529582977295

Four threads:

cpus:8, affinity[0, 1, 2, 3, 4, 5, 6, 7]
starting thread 0, numOfCycles is 2500
starting thread 1, numOfCycles is 2500
starting thread 2, numOfCycles is 2500
starting thread 3, numOfCycles is 2500
thread 1 finished
thread 2 finished
thread 0 finished
thread 3 finished
simulation took 23.508538484573364

I expect that when using 4 threads simulation time should be ideally 4 times smaller, and of cause it should not be the same.

Related: [When are Python threads fast?](https://stackoverflow.com/questions/8994438/when-are-python-threads-fast), [How to get a faster speed when using multi-threading in python](https://stackoverflow.com/questions/10154487/how-to-get-a-faster-speed-when-using-multi-threading-in-python), — wwii, Sep 03 '19 at 21:31

score 3 · Accepted Answer · answered Sep 03 '19 at 21:27

When you are using cPython, you won't get significant speedups by distributing computational load across threads. This is because memory accesses in cPython are serialized using the Python GIL mechanism (Global Interpreter Lock). I have experienced this when processing text for example.

In this case, if you monitor your CPU, you would likely see that your process is not fully utilizing 4 of them, just 25% of each.

You can use MultiProcessing to really spread your load across CPUs.

Threads can still provide performance improvements in Python when your threads are IO-bound (as opossed to CPU-bound).

Thank you very much for the explanation. ProcessPoolExecutor really does what I expect. — NwMan, Sep 04 '19 at 08:48

No gain from multiple threads when using ThreadPoolExecutor

1 Answers1