Improve execution time with Multithreading in python

Question

How can I implement multithreading to make this process faster? The program generates 1million random numbers and writes them on a file. It takes just over 2 seconds, but I'm wondering if multithreading would make it any faster.

import random
import time

startTime = time.time()

data = open("file2.txt", "a+")

for i in range(1000000):
  number = str(random.randint(1, 9999))
  data.write(number + '\n')
data.close()

executionTime = (time.time() - startTime)
print('Execution time in seconds: ', + str(executionTime))

ypnos · Accepted Answer · 2022-02-13T17:20:38.003

The short answer: Not easily.

Here is an example of using a multiprocessing pool to speed up your code:

import random
import time
from multiprocessing import Pool

startTime = time.time()

def f(_):
    number = str(random.randint(1, 9999))
    data.write(number + '\n')

data = open("file2.txt", "a+")
with Pool() as p:
    p.map(f, range(1000000))
data.close()

executionTime = (time.time() - startTime)
print(f'Execution time in seconds: {executionTime})')

Looks good? Wait! This is not a drop-in replacement as it lacks synchronization of the processes so not all 1000000 line will be written (some will be overwritten in the same buffer)! See Python multiprocessing safely writing to a file

So we need to separate computing the numbers (in parallel) from writing them (in serial). We can do this as follows:

import random
import time
from multiprocessing import Pool

startTime = time.time()

def f(_):
    return str(random.randint(1, 9999))

with Pool() as p:
    output = p.map(f, range(1000000))

with open("file2.txt", "a+") as data:
    data.write('\n'.join(output) + '\n')

executionTime = (time.time() - startTime)
print(f'Execution time in seconds: {executionTime})')

With that fixed, note that this is not multithreading but uses multiple processes. You can change it to multithreading with a different pool object:

from multiprocessing.pool import ThreadPool as Pool

On my system, I get from 1 second to 0.35 seconds with the processing pool. With the ThreadPool however it takes up to 2 seconds!

The reason is that Python's global interpreter lock prevents from multiple threads to process your task efficiently, see What is the global interpreter lock (GIL) in CPython?

To conclude, multithreading is not always the right answer:

In your scenario, one limit is the file access, only one thread can write to the file, or you would need to introduce locking, making any performance gain moot
Also in Python multithreading is only suitable for specific tasks, e.g. long computations that happen in a library below python and can therefore run in parallel. In your scenario, the overhead of multithreading negates the small potential for a performance benefit.

The upside: Yes, with multiprocessing instead of multithreading I got a 3x speedup on my system.

Perfect, thank you for the examples, it's much simpler than I thought! — rbhar, Feb 14 '22 at 13:17

Udesh · Answer 2 · 2022-02-14T16:37:21.410

write the string @ once to the file rather than individually writing each number to file & I have tested it with multithreading and its actually degrading performance since you are writing to same file if you do threading you have to also synchronize which will affect performance.

Code: (Execution time in seconds: 1.943817138671875)

import time

startTime = time.time()

import random
size = 1000_000
# pre declaring the list to save time from resize it later
l = [None] * size
# l = list(str(random.randint(1, 99999)) for i in range(size))
# random.randrange(0, )
for i in range(size):
    # l[i] = str(random.randint(1, 99999)) + "\n"
    l[i] = f"{random.randint(1, 99999)}\n"
    
# writing data @ once to the file
with open('file2.txt', 'w+') as data:
    data.write(''.join(l))

executionTime = (time.time() - startTime)

print('Execution time in seconds: ' + str(executionTime))

Output:

Execution time in seconds: 1.943817138671875

Yes, writing at once saves some time. However your time measurement is moot as you start the time measurement _after_ calling random.randint() a million times. In fact, these calls take up the majority of time. So what you are doing here is saving only a small fraction of the time. — ypnos, Feb 14 '22 at 15:20

Improve execution time with Multithreading in python

2 Answers2

Code: (Execution time in seconds: 1.943817138671875)

Output: