2

I'm fairly new to both Python and Pandas, and trying to figure out the fastest way to execute a mammoth left outer join between a left dataset with roughly 11 million rows and a right dataset with ~160K rows and four columns. It should be a many-to-one situation but I'd like the join to not kick out an error if there's a duplicate row on the right side. I'm using Canopy Express on a Windows 7 64-bit system with 8 Gb RAM, and I'm pretty much stuck with that.

Here's a model of the code I've put together so far:

import pandas as pd

leftcols = ['a','b','c','d','e','key']
leftdata = pd.read_csv("LEFT.csv", names=leftcols)

rightcols = ['x','y','z','key']
rightdata = pd.read_csv("RIGHT.csv", names=rightcols)

mergedata = pd.merge(leftdata, rightdata, on='key', how='left')
mergedata.to_csv("FINAL.csv")

This works with small files but produces a MemoryError on my system with file sizes two orders of magnitude smaller than the size of the files I actually need to merge.

I've been browsing through related questions (one, two, three) but none of the answers really get at this basic problem - or if they do, it's not explained well enough for me to recognize the potential solution. And the accepted answers are no help. I'm already on a 64 bit system and using the most current stable version of Canopy (1.5.5 64-bit, using Python 2.7.10).

What is the fastest and/or most pythonic approach to avoiding this MemoryError issue?

Community
  • 1
  • 1
James T
  • 149
  • 2
  • 13
  • The most pythonic way is to get more memory. No seriously. Pandas always keeps your data in memory. If your files are too big, you simply cannot do it with pandas. – cel Sep 17 '15 at 16:29
  • If a system with 8 GB RAM isn't enough to merge two files each 1/100th the size of the files I need, how much RAM would be enough? 1TB? How many supercomputers would I need to assemble? There must be a different way to proceed that uses less memory. If not pandas, how? – James T Sep 17 '15 at 16:35
  • AFAIK, all join/merge operations can be done by sort-merges. So it's not necessary to keep things in memory. – cel Sep 17 '15 at 16:38
  • And incidentally, the smaller test files are only 14 & 110 MB in size. No idea why 8 GB of system RAM isn't enough to merge those, even with Windows 7. The full files are 144 MB & 1.1 GB. The final file size shouldn't be more than 2 GB. – James T Sep 17 '15 at 16:43
  • Not incredibly familiar with sort-merges. Assuming you mean [this](http://interactivepython.org/runestone/static/pythonds/SortSearch/TheMergeSort.html). I can see how that's a great way to make a long list alpha-sorted. But I don't need to do that. If that method can be used to conduct a left join between two datasets I'd need some more explanation. – James T Sep 17 '15 at 16:56
  • 1
    You probably have repeated values in the merge 'keys' which results in a Cartesian product of such rows in the output – user1827356 Sep 17 '15 at 17:27
  • In the left dataset there are repeated keys. Just checked and there aren't any in the right dataset - the keys there are unique. Since it's a left join and not a full join, the number of rows in the output should be the same as the left dataset. Right? – James T Sep 17 '15 at 17:46
  • It should be easy to test. Just take 25% of you left set and see how many lines are output. I would be shocked if its a memory problem – user1827356 Sep 17 '15 at 18:56
  • I've already used 10% of the left set, both with the full right set and 10% of the right. MemoryError either way. I can't explain it either. – James T Sep 18 '15 at 12:54
  • 1
    When you say "the accepted answers are no help," do you mean to include the `pd.concat` answer as well? Try setting `index_col='key'` in your call to `read_csv` and then concatenating (or joining, if that complains about duplicate indices)? – James Sep 22 '15 at 16:17

3 Answers3

2

Why not just read your right file into pandas (or even into a simple dictionary), then loop through your left file using the csv module to read, extend, and write each row? Is processing time a significant constraint (vs your development time)?

Jonathan March
  • 5,800
  • 2
  • 14
  • 16
  • Processing time is not incredibly significant; I'll try it out. If I come up with a workable solution I'll accept the answer and post the code below. Thank you! – James T Sep 17 '15 at 17:37
  • Thank you Jonathan, this approach worked - see below. I think I'll mark the comment below as the answer since it'll provide browsing users who have similar issues a model of the code to work with. But I'd give you a second upvote if I could. Thanks! – James T Sep 23 '15 at 12:55
2

This approach ended up working. Here's a model of my code:

import csv

idata = open("KEY_ABC.csv","rU")
odata = open("KEY_XYZ.csv","rU")

leftdata = csv.reader(idata)
rightdata = csv.reader(odata)

def gen_chunks(reader, chunksize=1000000):
    chunk = []
    for i, line in enumerate(reader):
        if (i % chunksize == 0 and i > 0):
            yield chunk
            del chunk[:]
        chunk.append(line)
    yield chunk

count = 0

d1 = dict([(rows[3],rows[0]) for rows in rightdata])
odata.seek(0)    
d2 = dict([(rows[3],rows[1]) for rows in rightdata])
odata.seek(0)
d3 = dict([(rows[3],rows[2]) for rows in rightdata])

for chunk in gen_chunks(leftdata):
    res = [[k[0], k[1], k[2], k[3], k[4], k[5], k[6], 
                d1.get(k[6], "NaN")] for k in chunk]
    res1 = [[k[0], k[1], k[2], k[3], k[4], k[5], k[6], k[7], 
                d2.get(k[6], "NaN")] for k in res]
    res2 = [[k[0], k[1], k[2], k[3], k[4], k[5], k[6], k[7], k[8],
                d3.get(k[6], "NaN")] for k in res1]
    namestart = "FINAL_"
    nameend = ".csv"
    count = count+1
    filename = namestart + str(count) + nameend
    with open(filename, "wb") as csvfile:
        output = csv.writer(csvfile)
        output.writerows(res2)

By splitting the left dataset into chunks, turning the right dataset into one dictionary per non-key column, and by adding columns to the left dataset (filling them using the dictionaries and the key match), the script managed to do the whole left join in about four minutes with no memory issues.

Thanks also to user miku who provided the chunk generator code in a comment on this post.

That said: I highly doubt this is the most efficient way of doing this. If anyone has suggestions to improve this approach, fire away.

Community
  • 1
  • 1
James T
  • 149
  • 2
  • 13
0

As suggested in another question "Large data" work flows using pandas, dask (http://dask.pydata.org) could be an easy option.

Simple example

import dask.dataframe as dd
df1 = dd.read_csv('df1.csv')
df2 = dd.read_csv('df2.csv')
df_merge = dd.merge(df1, df2, how='left')
Petter Friberg
  • 21,252
  • 9
  • 60
  • 109
MicPie
  • 11
  • 2