Multiple Thread data generator

Question

I have a small python script used to generate lots of data to a file, it takes about 6 mins to generate 6GB data, however, my target data size could up to 1TB, for linear calculation, it will take about 1000 mins to generate 1TB data which I think it's unacceptable for me.

So I am wondering will multiple threading help me here to short the time? and why could that be? If not, do I have other options?

Thanks!

Your bottleneck here is likely the speed at which you can write to your harddisk, which means multiple threads/processes aren't going to help. — dano, Sep 12 '14 at 20:14
Try writing to /dev/null. The time difference will give you a hint of where the bottleneck is. — tdelaney, Sep 12 '14 at 20:21

score 2 · Accepted Answer · edited May 23 '17 at 12:12

Currently, typical hard drives can write on the order of 100 MB per second.

Your program is writing 6 GB in 6 minutes, which means the overall throughput is ~ 17 MB/s.

So your program is not pushing data to disk anywhere near the maximum rate (assuming you have a typical hard drive).

So your problem might actually be CPU-bound.

If this "back-of-the-envelope" calculation is correct, and if you have a machine with multiple processors, using multiple processes could help you generate more data quicker, which could then be sent to a single process which writes the data to disk.

Note that if you are using CPython, the most common implementation of Python, then the GIL (global interpreter lock) prevents multiple threads from running at the same time. So to do concurrent calculations, you need to use multiple processes rather than multiple threads. The multiprocessing or concurrent.futures module can help you here.

Note that if your hard drive can write 100 MB/s, it would still take ~ 160 minutes to write a 1TB to disk, and if your multiple processes generated data at a rate greater than 100 MB/s, then the extra processes would not lead to any speed gain.

Of course, your hardware may be much faster or much slower than this, so it pays to know your hardware specs.

You can estimate how fast you can write to disk using Python by doing a simple experiment:

with open('/tmp/test', 'wb') as f:
    x = 'A'*10**8
    f.write(x)

% time python script.py

real    0m0.048s
user    0m0.020s
sys 0m0.020s

% ls -l /tmp/test
-rw-rw-r-- 1 unutbu unutbu 100000000 2014-09-12 17:13 /tmp/test

This shows 100 MB were written in 0.511s. So the effective throughput was ~195 MB/s.

Note that if you instead call f.write in a loop:

with open('/tmp/test', 'wb') as f:
    for i in range(10**7):
        f.write('A')

then the effective throughput drops dramatically to just ~ 3MB/s. So how you structure your program -- even if using just a single process -- can make a big difference. This is an example of how collecting your data into fewer but bigger writes can improve performance.

As Max Noel and kipodi have already pointed out, you can also try writing to /dev/null:

with open(os.devnull, 'wb') as f:

and timing a shortened version of your current script. This will show you how much time is being consumed (mainly) by CPU computation. It's this portion of the overall run time that may be improved by using concurrent processes. If it is large then there is hope that multiprocessing may improve performance.

score 1 · Answer 2 · answered Sep 12 '14 at 20:14

In all likelihood, multithreading won't help you.

Your data generation speed is either:

IO-bound (that is, limited by the speed of your hard drive), and the only way to speed it up is to get a faster storage device. The only type of parallelization that can help you is finding a way to spread your writes across multiple devices (can you use multiple hard drives?).
CPU-bound, in which case Python's GIL means you can't take advantage of multiple CPU cores within one process. The way to speed your program up is to make it so you can run multiple instances of it (multiple processes), each generating part of your data set.

Regardless, the first thing you need to do is profile your program. What parts are slow? Why are they slow? Is your process IO-bound or CPU-bound? Why?

score 1 · Answer 3 · answered Sep 12 '14 at 21:01

6 mins to generate 6GB means you take a minute to generate 1 GB. A typical hard drive is capable of up to 80 - 100 MB/s throughput when new. This leaves you with approximately 6 GB / minute IO limit.
So it looks like the limiting factor is the CPU, which is good news (running more instances can help you).
However I wouldn't use multithreading for Python because of GIL. A better idea will be to run some scripts writing to different offsets in different processes or tu multiprocessing module of Python.
I would check it though with running it an writing to /dev/null to make sure you truly are CPU bound.

Multiple Thread data generator

3 Answers3