19

I'm trying to a parallelize an application using multiprocessing which takes in a very large csv file (64MB to 500MB), does some work line by line, and then outputs a small, fixed size file.

Currently I do a list(file_obj), which unfortunately is loaded entirely into memory (I think) and I then I break that list up into n parts, n being the number of processes I want to run. I then do a pool.map() on the broken up lists.

This seems to have a really, really bad runtime in comparison to a single threaded, just-open-the-file-and-iterate-over-it methodology. Can someone suggest a better solution?

Additionally, I need to process the rows of the file in groups which preserve the value of a certain column. These groups of rows can themselves be split up, but no group should contain more than one value for this column.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
user1040625
  • 403
  • 1
  • 4
  • 11

2 Answers2

20

list(file_obj) can require a lot of memory when fileobj is large. We can reduce that memory requirement by using itertools to pull out chunks of lines as we need them.

In particular, we can use

reader = csv.reader(f)
chunks = itertools.groupby(reader, keyfunc)

to split the file into processable chunks, and

groups = [list(chunk) for key, chunk in itertools.islice(chunks, num_chunks)]
result = pool.map(worker, groups)

to have the multiprocessing pool work on num_chunks chunks at a time.

By doing so, we need roughly only enough memory to hold a few (num_chunks) chunks in memory, instead of the whole file.


import multiprocessing as mp
import itertools
import time
import csv

def worker(chunk):
    # `chunk` will be a list of CSV rows all with the same name column
    # replace this with your real computation
    # print(chunk)
    return len(chunk)  

def keyfunc(row):
    # `row` is one row of the CSV file.
    # replace this with the name column.
    return row[0]

def main():
    pool = mp.Pool()
    largefile = 'test.dat'
    num_chunks = 10
    results = []
    with open(largefile) as f:
        reader = csv.reader(f)
        chunks = itertools.groupby(reader, keyfunc)
        while True:
            # make a list of num_chunks chunks
            groups = [list(chunk) for key, chunk in
                      itertools.islice(chunks, num_chunks)]
            if groups:
                result = pool.map(worker, groups)
                results.extend(result)
            else:
                break
    pool.close()
    pool.join()
    print(results)

if __name__ == '__main__':
    main()
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • I lied when I said the lines aren't interrelated--in the csv, there is a column which needs to be split by (a name column, and all rows with that name can't be split up). However, I think I can adapt this to group on this criteria. Thanks! I knew nothing about itertools, and now I little more than nothing. – user1040625 Jan 03 '12 at 19:25
  • There was an error in my original code. All calls to `pool.apply_async` are non-blocking, so the entire file was being queued up at once. This would have resulted in no memory saving. So I've changed the loop a bit to queue up `num_chunks` at a time. The call to `pool.map` is blocking, which will prevent the entire file from being queued up at once. – unutbu Jan 03 '12 at 21:03
  • @HappyLeapSecond a user is trying to implement your methods here http://stackoverflow.com/questions/31164731/python-chunking-csv-file-multiproccessing and is having trouble. Perhaps you can help? – m0meni Jul 01 '15 at 17:07
2

I would keep it simple. Have a single program open the file and read it line by line. You can choose how many files to split it into, open that many output files, and every line write to the next file. This will split the file into n equal parts. You can then run a Python program against each of the files in parallel.

Joe
  • 46,419
  • 33
  • 155
  • 245