1

I try to merge two large size dataframes.

One dataframe (patent_id) has 5,271,459 of rows and the others have more than 10,000 of columns.

To combine these two big dataframes, I use "merge" and separate right dataframe into chunks. (similar with MemoryError with python/pandas and large left outer joins)

But it still meets a memory error. Is there any space for improvements?

Should I use "concat" rather than "merge"?

Or should I use "csv" rather than "pandas" to manage this issue like (MemoryError with python/pandas and large left outer joins)?

for key in column_name:
    print key
    newname = '{}_post.csv'.format(key)
    patent_rotated_chunks = pd.read_csv(newname, iterator=True, chunksize=10000)    

    temp = patent_id.copy(deep=True)

    for patent_rotated in patent_rotated_chunks: 
        temp = pd.merge(temp,patent_rotated,on = ["patent_id_0"],how = 'left')

    temp.to_csv('{}_sorted.csv'.format(key))

    del temp
smci
  • 32,567
  • 20
  • 113
  • 146
JonghoKim
  • 1,965
  • 7
  • 21
  • 44
  • 2
    Pandas uses tons of memory. Don't use it unless you have a lot of memory to spend. See this: http://stackoverflow.com/questions/34783461/need-to-merge-2-large-csv-files-row-by-row-in-python – Javier Aug 22 '16 at 14:12

1 Answers1

1

Below approach works for me which is from MemoryError with python/pandas and large left outer joins

import csv

def gen_chunks(reader, chunksize=1000000):
    chunk = []
    for i, line in enumerate(reader):
        if (i % chunksize == 0 and i > 0):
            yield chunk
            del chunk[:]
        chunk.append(line)
    yield chunk

for key in column_name:

    idata = open("patent_id.csv","rU")
    newcsv = '{}_post.csv'.format(key)
    odata = open(newcsv,"rU")

    leftdata = csv.reader(idata)
    next(leftdata)

    rightdata = csv.reader(odata)

    index = next(rightdata).index("patent_id_0")
    odata.seek(0)
    columns = ["project_id"] + next(rightdata)
    rd = dict([(rows[index], rows) for rows in rightdata])

    print rd.keys()[0] 
    print rd.values()[0]

    with open('{}_sorted.csv'.format(key), "wb") as csvfile:
        output = csv.writer(csvfile)
        output.writerows(columns)

        for chunk in gen_chunks(leftdata):
            print key, " New Chunk!"
            ld = [[pid[1]]+ rd.get(pid[1], ["NaN"]) for pid in chunk]
            output.writerows(ld)
Community
  • 1
  • 1
JonghoKim
  • 1,965
  • 7
  • 21
  • 44