How to maximize throughput when processing many files

Question

Say you want to process many files as quickly as possible, where processing time > file read time.

Will reading multiple files using a thread pool increase throughput? or does it just cause more disk contention?
If a thread pool does help, what determines how many threads are needed to achieve the maximum? can this be calculated based on the target system?
For a single core, will a loop reading and processing asynchronously via threads be faster than doing it synchronously? I assume since disk latency is so high, it would be. But maybe if the file read is much much smaller than processing time, it is better to let the processing step finish uninterrupted without context switches.

Also, do you have any other tips for maximizing disk throughput?

*any other tips for maximizing disk throughput?* Buy faster disks and be done with the problem for all time, without worrying about bugs in your processing algorithms, spending time and money writing code, and having to maintain all that code in the future. — Andrew Henle, Mar 07 '19 at 10:22
It depends on a lot of factors. OS, available CPU's, available memory, disk performance and so on. When the files are small < 1 MB and they're just a few hundrets then maybe reading them all into memory and then processing them can be faster than reading, precessing, reading and so on. But I belive you have to test and profile things on your own. — user743414, Mar 07 '19 at 13:54
@AndrewHenle That's always good to keep in mind. Though, if the software is intended to run on a variety of different hardware/OS configurations, like a framework, you still would want to employ some software-based techniques as well. — Azmisov, Mar 15 '19 at 00:44
@user743414 Despite the numerous possible configurations, I suspect the internals of reading data from disk is implemented pretty similarly across the board for various motherboards, CPU's, RAM, etc. I was hoping someone with more expertise on the internals could describe the general principles, without having to benchmark across many rigs. — Azmisov, Mar 15 '19 at 00:48

Azmisov · Answer 1 · 2019-03-15T00:49:55.440

I did some benchmarking to come up with some general guidelines. I tested with about ~500k smallish (~14kb) files. I think the results should be similar for medium sized files; but for larger files, I suspect disk contention becomes more significant. It would be appreciated if someone with deeper knowledge of OS/hardware internals could supplement this answer with more concrete explanations for why some things are faster than others.

I tested with a 16 virtual core (8 physical) computer with dual channel RAM and Linux kernel 4.18.

Do multiple threads increase read throughput?

The answer is yes. I think this could be either due to 1) a hardware bandwidth limitation for single threaded applications or 2) the OS's disk request queue is better utilized when many threads are making requests. The best performance is with virtual_cores*2 threads. Throughput slowly degrades beyond that, perhaps because of increased disk contention. If the pages happen to be cached in RAM, then it is better to have a thread pool of size virtual_cores. If however < 50% of pages are cached (which I think is the more common case), then virtual_cores*2 will do just fine.

I think the reason why virtual_cores*2 is better than just virtual_cores is that a file read also includes some non-disk related latency like system calls, decoding, etc. So perhaps the processor can interleave the threads more effectively: while one is waiting on the disk, a second can be executing the non-disk related file read operations. (Could it also be due to the fact that the RAM is dual channel?)

I tested reading random files vs sequentially (by looking up the files' physical block location in storage, and ordering the requests by this). Sequential access gives a pretty significant improvement with HDDs, which is to be expected. If the limiting factor in your application is file read time, as opposed to the processing of said files, I suggest you reorder the requests for sequential access to get a boost.

There is the possibility to use asynchronous disk IO, instead of a thread pool. However, from my readings it appears there is not a portable way to do it yet (see this reddit thread). Also, libuv which powers NodeJS uses a thread pool to handle its file IO.

Balancing read vs processing throughput

Ideally, we could have reading and processing in separate threads. While we are processing the first file, we can be queuing up the next one in another thread. But the more threads we allocate for reading files, the more CPU contention with the processing threads. The solution is to give the faster operation (reading vs processing) the fewest number of threads while still giving zero processing delay between files. This formula seemed to give good results in my tests:

prop = read_time/process_time
if prop > 1:
    # double virtual core count gives fastest reads, as per tests above
    read_threads = virtual_cores*2
    process_threads = ceil(read_threads/(2*prop))
else:
    process_threads = virtual_cores
    # double read thread pool so CPU can interleave better, as mentioned above
    read_threads = 2*ceil(process_threads*prop)

For example: Read = 2s, Process = 10s; so have 2 reading threads for every 5 processing threads

In my tests, there is only about a 1-1.5% performance penalty for having extra reading threads. In my tests, for a prop close to zero, 1 read + 16 process threads had nearly the same throughput as 32 read + 16 process threads. Modern threads should be pretty lightweight, and the read threads should be sleeping anyways if the files aren't being consumed fast enough. (The same should be true of process threads when prop is very large)

On the other hand, too few reading threads has a much more significant impact (my third original question). For example, for a very large prop, 1 read + 16 process threads was 36% slower than 1 read + 15 process threads. Since the process threads are occupying all the benchmark computer's cores, the read thread has too much CPU contention and fails 36% of the time to queue up the next file to be processed. So, my recommendation is to err in favor of too many read threads. Doubling the read thread pool size as in my formula above should accomplish this.

Side note: You can limit the CPU resources your application consumes by setting virtual_cores to be a smaller percentage of the available cores. You may also choose to forego doubling, since CPU contention may be less of an issue when there is a spare core or more that is not executing the more intensive processing threads.

Summary

Based on my test results, using a thread pool with virtual_cores*2 file reading threads + virtual_cores file processing threads, will give you good performance for a variety of different timing scenarios. This configuration should give you within ~2% of the maximal throughput, without having to spend lots of time benchmarking.

Another thing to mention is that these days, SSD's will often advertise their random read performance in terms of queue depth, e.g. QD16 is the read performance if 16 threads are requesting 4kb of data. You can use that to guage what kind of queue depth (async disk read threads) would be optimal. — Azmisov, Jun 22 '21 at 01:37
Here is another relevant question on asynchronous file IO, rather than using a thread pool: https://stackoverflow.com/questions/13407542/is-there-really-no-asynchronous-block-i-o-on-linux — Azmisov, Aug 10 '21 at 22:42

How to maximize throughput when processing many files

1 Answers1