Re-assemble many csvs using pandas

Asked Dec 10 '14 at 22:53

Active Dec 10 '14 at 22:53

Viewed 54 times

Say I have an infinitely large hard drive space storing an infinitely large csv, but only 4GB of RAM.

Reading the file into pandas is no problem using:

reader = pandas.read_csv('./tools/OCHIN_forgeo.csv', chunksize=10000)
for i,r in enumerate(reader):
    result_df = analyze_chunk(r)
    result_df.to_csv('chunk_{}.csv'.format(i))

if now I want to reassemble the chunks into a full result, the following would not work:

files = glob.glob('chunk_*.csv')
master_df = pandas.concat(pandas.read_csv(f, index_col=False) for f in files)
master_df.to_csv('master_df_output.csv',index=False)

how can I iteratively read the chunks and output them to disk without running out of RAM?

asked Dec 10 '14 at 22:53

bcollins

3,379
4
19
35

1

You can append to an open file, as this question shows http://stackoverflow.com/questions/17530542/how-to-add-pandas-data-to-an-existing-csv-file – mbatchkarov Dec 10 '14 at 22:56
awesome, do you know if df.to_csv('my_csv.csv', mode='a', header=False) will respect differences in columns? say one csv is missing a column – bcollins Dec 10 '14 at 23:09
1

I don;t know of the top of my head.Come to think about it, you don't have to load the files into Python. You could just `cat` them all together (assuming only the first one has a header) – mbatchkarov Dec 10 '14 at 23:21
good point. hey thanks for your help. I – bcollins Dec 11 '14 at 05:04

Re-assemble many csvs using pandas

0 Answers0