I only have 4G memory to use. the info of the files are following:
File | Number of Rows | Num of cols | Header name
1st csv | 2,000,000+ rows | 3 cols. | id1,id2,...
2nd csv | 10,000,000+ rows| 24 cols. | id2,...
3rd csv | 170 rows | 5 cols. | id1,...
What I want to do is :
file1=pd.read_csv('data1.csv')
file2=pd.read_csv('data2.csv')
file3=pd.read_csv('data3.csv')
data=pd.merge(file1,file3,on='id1',how='left')
data=pd.merge(data,file2,on='id2',how='left')
#data to csv files: merge.csv
but memory is not enough, I have tried two ways: the first way is:
for data1_chunk in data1:
for data2_chunk in data2:
data = pd.merge(data1_chunk, data2_chunk, on='id2')
data_merge = pd.concat([data_merge, data])
the sencond way is:
for data1_chunk, data2_chunk in zip(data1, data2):
data_merge = pd.merge(data1_chunk, data2_chunk, on='id2', how='left')
But they do not work.
Is there any way using the para chunksize
to deal with big csv files?
Or other better or easy ways?
the question How to read a 6 GB csv file with pandas only tell how to deal one big csv file but not two or more, I want to know how to do the 'iterator' in two or more files with limited memory