How to read / process large files in parallel with Python

Question

I have a large file almost 20GB, more than 20 mln lines and each line represents separate serialized JSON.

Reading file line by line as a regular loop and performing manipulation on line data takes a lot of time.

Is there any state of art approach or best practices for reading large files in parallel with smaller chunks in order to make processing faster?

I'm using Python 3.6.X

No. Python cannot do real parallel threading. Read https://stackoverflow.com/questions/18114285/what-are-the-differences-between-the-threading-and-multiprocessing-modules. An alternative method you may try is to divide your file into multiple file, and run with multiprocessing. — MT-FreeHK, Jun 01 '18 at 04:31
the task is simple just read the file and convert it into JSON and perform operations regarding ML, — Nodirbek Shamsiev, Jun 01 '18 at 05:15
@MT-FreeHK Python can do real parallel processing with the multiprocessing module. You can prove this to yourself by launching computational-only process load and watching the utilization of each core. However, that doesn't help in this case trying to parallel read a file. — Michael Martinez, Jul 26 '22 at 14:54

score 3 · Answer 1 · answered Jun 01 '18 at 05:01

Unfortunately, no. Reading in files and operating on the lines read (such as json parsing or computation) is a CPU-bound operation, so there's no clever asyncio tactics to speed it up. In theory one could utilize multiprocessing and multiple cores to read and process in parallel, but having multiple threads reading the same file is bound to cause major problems. Because your file is so large, storing it all in memory and then parallelizing the computation is also going to be difficult.

Your best bet would be to head this problem off at the pass by partitioning the data (if possible) into multiple files, which could then open up safer doors to parallelism with multiple cores. Sorry there isn't a better answer AFAIK.

the problem isn't the computational part, because that can be parallelized. The problem is the I/O because there's no way parallelize that apart from a hardware solution (RAID). — Michael Martinez, Jul 26 '22 at 14:57

score 1 · Answer 2 · answered Jun 01 '18 at 05:22

There are several possibilites, but first profile your code in find the bottlenecks. Maybe your processing does some slows things which can be speed up - which would be vastly preferable to multiprocessing. If that does not help, you could try:

Use another file format. Reading serialized json from text is not the fastest operation in the world. So you could store your data (for example in hdf5) which could speed up processing.
Implement multiple worker processes which can read portions of the file (worker1 reads lines 0 - 1million, worker2 1million - 2million etc). You can orchestrate that with joblib or celery, depending on your needs. Integrating the results is the challenge, there you have to see what your needs are (map-reduce style?). This is more difficult in python due to no real threading than in other languages, so maybe you could switch the language for that.

Michael Martinez · Answer 3 · 2022-07-26T18:42:42.060

The main bottleneck you need to be aware of here is neither disk nor CPU, it is memory, not how much you have, but how the hardware and OS work together to pre-fetch pages from RAM into the L{1,2,3} caches.

The parallel approach will have worse performance than the serial approach if you use readline() to load one line at a time. The reason has to do with hardware+OS, not software. When the cpu requires a memory read, a certain amount of extra data is fetched in to the Lx caches of the CPU in anticipation that this data might be required later. When you employ the serial approach, this extra data is in fact used while it is still in the cache. But when parallel readlines() are happening, the extra data is preempted before there is a chance to use it. Hence it has to be fetched again later. This has a huge impact on performance.

The way to make the parallel approach beat the performance of the serial readline() approach is to have your parallel processes read more than one line at a time into memory. Use read() instead of readline(). How many bytes should you read? Approximately the size of your L1 cache, about 64K. With this, several pages of contiguous memory can be loaded into that cache.

However, if you replace serial readline() with serial read() this will outperform parallel read(). Why? Because although each core has its own L1 cache, the cores are sharing the other L caches and therefore you're running into the same problem. Process 1 will pre-fetch to populate the caches, but before it has time to process it all, Process 2 takes over and replaces the contents of the cache. Therefore Process 1 will have to fetch the same data again later.

The performance difference between serial and parallel can be easily seen: First make sure the whole file is in the page cache (free -m: buff/cache). By doing this you are totally removing the disk/block device layer from the equation. The whole file is in RAM. Then run a simple serial code and time it. Then run a simple parallel code using Process() and time it. On commodity systems, your serial reads will outperform your parallel reads.

This is contrary to what you expect, right? But you have to think about the assumptions you were making. Your assumptions didn't include the fact that there are caches and memory buses being shared between multiple cores. This is precisely where the bottleneck is located. We know disk isn't the bottleneck because we have loaded the entire file into page cache in RAM. We know CPU isn't the bottleneck because we are using multiprocessing.Process() which ensures simultaneous execution, one process per core (vmstat 1, top -d 1 %1).

The only way you can have better performance from the parallel approach is to make sure that you are running hardware with separate NuMA nodes where each core has its own memory bus and its own caches.

As a side note, contrary to what other answers have claimed, Python can certainly do true computational multiprocessing, where you put a 100% utilization on each core at the same time. This is done using multiprocessing.Process(). Thread Pools will not work for this because the threads are tied to a single core and are restricted by python's GIL. But multiprocessing.Process() is not restricted by the GIL and certainly does work.

Another side note: it is bad practice to try to load your entire input file into your heap. You never know if the file is larger than can fit into RAM which will cause an oom. So don't try doing this in an attempt to optimize the performance of your code.

How to read / process large files in parallel with Python

3 Answers3

Linked

Related