3

I only have 4G memory to use. the info of the files are following:

File    | Number of Rows  | Num of cols | Header name 
1st csv | 2,000,000+ rows | 3 cols.     | id1,id2,...
2nd csv | 10,000,000+ rows| 24 cols.    | id2,...
3rd csv | 170 rows        | 5 cols.     | id1,...

file information image

What I want to do is :

file1=pd.read_csv('data1.csv')
file2=pd.read_csv('data2.csv')
file3=pd.read_csv('data3.csv')
data=pd.merge(file1,file3,on='id1',how='left')
data=pd.merge(data,file2,on='id2',how='left')
#data to csv files: merge.csv

but memory is not enough, I have tried two ways: the first way is:

for data1_chunk in data1:
    for data2_chunk in data2:
        data = pd.merge(data1_chunk, data2_chunk, on='id2')
        data_merge = pd.concat([data_merge, data])

the sencond way is:

for data1_chunk, data2_chunk in zip(data1, data2):
    data_merge = pd.merge(data1_chunk, data2_chunk, on='id2', how='left')

But they do not work.

Is there any way using the para chunksize to deal with big csv files? Or other better or easy ways?

the question How to read a 6 GB csv file with pandas only tell how to deal one big csv file but not two or more, I want to know how to do the 'iterator' in two or more files with limited memory

Kiwi Qi
  • 63
  • 6
  • Possible duplicate of [How to read a 6 GB csv file with pandas](https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas) – Nyerguds Apr 23 '18 at 07:35
  • that question only can tell how to deal with one big file, but not two or more. The key is how to do "iteration" in two or more files when they are all read only a part by parameter ''chunksize" in pandas.read_csv – Kiwi Qi Apr 23 '18 at 08:15
  • Have a look at [Dask](https://dask.pydata.org/en/latest/) – Little Bobby Tables Apr 23 '18 at 08:17
  • In order to _merge_ data, it seems logical you need to have all that data in memory... – Nyerguds Apr 23 '18 at 08:18
  • For example, the left csv file only has one row, but the right file has 1 billion rows, to merge left and right, just load one or 1000 rows into memory, and do iteration on right files. – Kiwi Qi Apr 23 '18 at 09:30
  • Have you considered using a Database – Bruce Martin Apr 23 '18 at 11:18
  • It is a good idea, I am trying mysql now, but I still want to know how to do this using python and pandas. ╮(╯▽╰)╭ – Kiwi Qi Apr 23 '18 at 12:20

1 Answers1

2

I find that using the code following maybe work, logically:

file1 = pd.read_csv('data1.csv', chunksize=100, iterator=True)
temp = None
temp_chunk = None
for chunk1 in file1:
    file2 = pd.read_csv('data2.csv', chunksize =100, iterator=True)
    for chunk2 in file2:
        temp_chunk = pd.merge(chun1, chunk2, on='id', how='inner')
        temp = pd.concat([temp, temp_chunk])
finalData = temp.drop_duplicates(keep='first')
process finalData...

It takes more times but less memory I think.

Kiwi Qi
  • 63
  • 6