Say I have an infinitely large hard drive space storing an infinitely large csv, but only 4GB of RAM.
Reading the file into pandas is no problem using:
reader = pandas.read_csv('./tools/OCHIN_forgeo.csv', chunksize=10000)
for i,r in enumerate(reader):
result_df = analyze_chunk(r)
result_df.to_csv('chunk_{}.csv'.format(i))
if now I want to reassemble the chunks into a full result, the following would not work:
files = glob.glob('chunk_*.csv')
master_df = pandas.concat(pandas.read_csv(f, index_col=False) for f in files)
master_df.to_csv('master_df_output.csv',index=False)
how can I iteratively read the chunks and output them to disk without running out of RAM?