Chunking, processing & merging dataset in Pandas/Python

Question

There is a large dataset, containing a strings. I just want to open it via read_fwf using widths, like this:

widths = [3, 7, ..., 9, 7]
tp = pandas.read_fwf(file, widths=widths, header=None)

It would help me to mark the data, But the system crashes (works with nrows=20000). Then I decided to do it by chunk (e.g. 20000 rows), like this:

cs = 20000
for chunk in pd.read_fwf(file, widths=widths, header=None, chunksize=ch)
...:  <some code using chunk>

My question is: what should I do in a loop to merge (concatenate?) the chunks back in a .csv file after some processing of chunk (marking the row, dropping or modyfiing the column)? Or there is another way?

score 6 · Accepted Answer · answered Apr 28 '15 at 01:30

I'm going to assume that since reading the entire file

tp = pandas.read_fwf(file, widths=widths, header=None)

fails but reading in chunks works, that the file is too big to be read at once and that you encountered a MemoryError.

In that case, if you can process the data in chunks, then to concatenate the results in a CSV, you could use chunk.to_csv to write the CSV in chunks:

filename = ...
for chunk in pd.read_fwf(file, widths=widths, header=None, chunksize=ch)
    # process the chunk
    chunk.to_csv(filename, mode='a')

Note that mode='a' opens the file in append mode, so that the output of each chunk.to_csv call is appended to the same file.

Chunking, processing & merging dataset in Pandas/Python

1 Answers1

Linked