Parallel processing of a large .csv file in Python

Question

I'm processing large CSV files (on the order of several GBs with 10M lines) using a Python script.

The files have different row lengths, and cannot be loaded fully into memory for analysis.

Each line is handled separately by a function in my script. It takes about 20 minutes to analyze one file, and it appears disk access speed is not an issue, but rather processing/function calls.

The code looks something like this (very straightforward). The actual code uses a Class structure, but this is similar:

csvReader = csv.reader(open("file","r")
for row in csvReader:
   handleRow(row, dataStructure)

Given the calculation requires a shared data structure, what would be the best way to run the analysis in parallel in Python utilizing multiple cores?

In general, how do I read multiple lines at once from a .csv in Python to transfer to a thread/process? Looping with for over the rows doesn't sound very efficient.

Thanks!

Perhaps Python (because it is interpreted) might not be the right tool to deal with very large data sets? Have you considered recoding your calculation in a faster, compiled, language (Ocaml, C++, ...)? — Basile Starynkevitch, Dec 08 '11 at 00:48
I have considered it. It's a question of dev time vs. flexibility. At this time I insist on Python because it's so much faster to develop complex analysis code in it. — Ron, Dec 08 '11 at 01:01
You could also use the [fh.readlines(size)](http://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects) method to read several MBs at a time. Then pass these blocks of lines into a thread/process. — max, Feb 11 '13 at 21:12

score 25 · Answer 1 · edited Nov 14 '17 at 14:36

This might be too late, but just for future users I'll post anyway. Another poster mentioned using multiprocessing. I can vouch for it and can go into more detail. We deal with files in the hundreds of MB/several GB every day using Python. So it's definitely up to the task. Some of files we deal with aren't CSVs, so the parsing can be fairly complex and take longer than the disk access. However, the methodology is the same no matter what file type.

You can process pieces of the large files concurrently. Here's pseudo code of how we do it:

import os, multiprocessing as mp

# process file function
def processfile(filename, start=0, stop=0):
    if start == 0 and stop == 0:
        ... process entire file...
    else:
        with open(file, 'r') as fh:
            fh.seek(start)
            lines = fh.readlines(stop - start)
            ... process these lines ...

    return results

if __name__ == "__main__":

    # get file size and set chuck size
    filesize = os.path.getsize(filename)
    split_size = 100*1024*1024

    # determine if it needs to be split
    if filesize > split_size:

        # create pool, initialize chunk start location (cursor)
        pool = mp.Pool(cpu_count)
        cursor = 0
        results = []
        with open(file, 'r') as fh:

            # for every chunk in the file...
            for chunk in xrange(filesize // split_size):

                # determine where the chunk ends, is it the last one?
                if cursor + split_size > filesize:
                    end = filesize
                else:
                    end = cursor + split_size

                # seek to end of chunk and read next line to ensure you 
                # pass entire lines to the processfile function
                fh.seek(end)
                fh.readline()

                # get current file location
                end = fh.tell()

                # add chunk to process pool, save reference to get results
                proc = pool.apply_async(processfile, args=[filename, cursor, end])
                results.append(proc)

                # setup next chunk
                cursor = end

        # close and wait for pool to finish
        pool.close()
        pool.join()

        # iterate through results
        for proc in results:
            processfile_result = proc.get()

    else:
        ...process normally...

Like I said, that's only pseudo code. It should get anyone started who needs to do something similar. I don't have the code in front of me, just doing it from memory.

But we got more than a 2x speed up from this on the first run without fine tuning it. You can fine tune the number of processes in the pool and how large the chunks are to get an even higher speed up depending on your setup. If you have multiple files as we do, create a pool to read several files in parallel. Just be careful no to overload the box with too many processes.

Note: You need to put it inside an "if main" block to ensure infinite processes aren't created.

Can this be modified to operate on multiple text files simultaneously (i.e. if you need to compare files line-by-line)? — Roko Mijic, Aug 08 '17 at 10:26
Absolutely. That was what our code was using this method for. We were not doing a simple diff comparison but matching lines in the files together. Depending on the file size it might be easier to just read all the lines of each file and walk the lists at the same time to do the comparison. — max, Aug 24 '17 at 00:19
in the end I read all my files into memory, and that worked out well for me. the files are ~1GB each but the time spent processing per line is significantly greater than the time cost of reading into memory per line. — Roko Mijic, Aug 24 '17 at 11:33

score 9 · Accepted Answer · answered Dec 08 '11 at 00:50

9

Try benchmarking reading your file and parsing each CSV row but doing nothing with it. You ruled out disk access, but you still need to see if the CSV parsing is what's slow or if your own code is what's slow.

If it's the CSV parsing that's slow, you might be stuck, because I don't think there's a way to jump into the middle of a CSV file without scanning up to that point.

If it's your own code, then you can have one thread reading the CSV file and dropping rows into a queue, and then have multiple threads processing rows from that queue. But don't bother with this solution if the CSV parsing itself is what's making it slow.

answered Dec 08 '11 at 00:50

dkamins

21,450
7
55
59

That's a good suggestion. How would you go benchmarking the .csv processing given it's done within the `for` loop iteration? – Ron Dec 08 '11 at 01:02
Well, found this: http://stackoverflow.com/questions/2359253/solving-embarassingly-parallel-problems-using-python-multiprocessing It's close enough to this description, so I guess I'll use queues. – Ron Dec 08 '11 at 01:32
1

I mean e.g. run the same for loop but instead of `handleRow(row, dataStructure)` you just say `pass` – dkamins Dec 08 '11 at 06:23

score 9 · Answer 3 · answered Dec 08 '11 at 01:04

9

Because of the GIL, Python's threading won't speed-up computations that are processor bound like it can with IO bound.

Instead, take a look at the multiprocessing module which can run your code on multiple processors in parallel.

answered Dec 08 '11 at 01:04

Raymond Hettinger

216,523
63
388
485

score 4 · Answer 4 · answered Aug 22 '18 at 13:42

Just found a solution to this old problem. I tried Pool.imap, and it seems to simplify processing large file significantly. imap has one significant benefit when comes to processing large files: It returns results as soon as they are ready, and not wait for all the results to be available. This saves lot of memory.

(Here is an untested snippet of code which reads a csv file row by row, process each row and write it back to a different csv file. Everything is done in parallel.)

import multiprocessing as mp
import csv

CHUNKSIZE = 10000   # Set this to whatever you feel reasonable
def _run_parallel(csvfname, csvoutfname):
   with open(csvfname) as csvf, \
        open(csvoutfname, 'w') as csvout\
        mp.Pool() as p:
       reader = csv.reader(csvf)
       csvout.writerows(p.imap(process, reader, chunksize=CHUNKSIZE))

score 4 · Answer 5 · answered Dec 08 '11 at 01:15

If the rows are completely independent just split the input file in as many files as CPUs you have. After that, you can run as many instances of the process as input files you have now. This instances, since they are completely different processes, will not be bound by GIL problems.

score 1 · Answer 6 · answered Jul 30 '12 at 22:37

1

If you use zmq and a DEALER middle man, you'd be able spread the row processing not just to the CPUs on your computer but across a network to as many processes as necessary. This would essentially guarentee that you hit an IO limit vs a CPU limit :)

answered Jul 30 '12 at 22:37

g19fanatic

10,567
6
33
63

Parallel processing of a large .csv file in Python

6 Answers6

Linked