0

Say I have an infinitely large hard drive space storing an infinitely large csv, but only 4GB of RAM.

Reading the file into pandas is no problem using:

reader = pandas.read_csv('./tools/OCHIN_forgeo.csv', chunksize=10000)
for i,r in enumerate(reader):
    result_df = analyze_chunk(r)
    result_df.to_csv('chunk_{}.csv'.format(i))

if now I want to reassemble the chunks into a full result, the following would not work:

files = glob.glob('chunk_*.csv')
master_df = pandas.concat(pandas.read_csv(f, index_col=False) for f in files)
master_df.to_csv('master_df_output.csv',index=False)

how can I iteratively read the chunks and output them to disk without running out of RAM?

bcollins
  • 3,379
  • 4
  • 19
  • 35
  • 1
    You can append to an open file, as this question shows http://stackoverflow.com/questions/17530542/how-to-add-pandas-data-to-an-existing-csv-file – mbatchkarov Dec 10 '14 at 22:56
  • awesome, do you know if df.to_csv('my_csv.csv', mode='a', header=False) will respect differences in columns? say one csv is missing a column – bcollins Dec 10 '14 at 23:09
  • 1
    I don;t know of the top of my head.Come to think about it, you don't have to load the files into Python. You could just `cat` them all together (assuming only the first one has a header) – mbatchkarov Dec 10 '14 at 23:21
  • good point. hey thanks for your help. I – bcollins Dec 11 '14 at 05:04

0 Answers0