10

I'm using pandas to do an outer merge on a set of about ~1000-2000 CSV files. Each CSV file has an identifier column id which is shared between all the CSV files, but each file has a unique set of columns of 3-5 columns. There are roughly 20,000 unique id rows in each file. All I want to do is merge these together, bringing all the new columns together and using the id column as the merge index.

I do it using a simple merge call:

merged_df = first_df # first csv file dataframe
for next_filename in filenames:
   # load up the next df
   # ...
   merged_df = merged_df.merge(next_df, on=["id"], how="outer")

The problem is that with nearly 2000 CSV files, I get a MemoryError in the merge operation thrown by pandas. I'm not sure if this is a limitation due to a problem in the merge operation?

The final dataframe would have 20,000 rows and roughly (2000 x 3) = 6000 columns. This is large, but not large enough to consume all the memory on the computer I am using which has over 20 GB of RAM. Is this size too much for pandas manipulation? Should I be using something like sqlite instead? Is there something I can change in the merge operation to make it work on this scale?

thanks.

3 Answers3

8

I think you'll get better performance using a concat (which acts like an outer join):

dfs = (pd.read_csv(filename).set_index('id') for filename in filenames)
merged_df = pd.concat(dfs, axis=1)

This means you are doing only one merge operation rather than one for each file.

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • 1
    As of memory, you should be able to use a gen expression instead of list comprehension...(not sure about the inner workings of the `concat` though) – root Jun 19 '13 at 19:09
  • @root well, generator can only be better I think (worst case it just converts it to a list) :) – Andy Hayden Jun 19 '13 at 19:12
  • @root Good spot btw! (tbh I didn't know concat would accept a generator!) – Andy Hayden Jun 19 '13 at 19:20
  • Well, I actually checked it before commenting :P; it isn't mentioned in the docs I think... – root Jun 19 '13 at 19:23
  • @AndyHayden: brilliant solution, works, thank you! could you explain why it works better though? is it that each individual merge takes up memory that is not garbage collected in time and then you run out of memory? –  Jun 19 '13 at 19:28
  • 1
    @user248237dfsf well, the main thing is that you are doing lots more merge operations, and also it only builds *one* DataFrame (again this is an expensive operation). I wonder if there a leakage bug, I don't see why it should run out of memory... (I see why it would be much slower though). – Andy Hayden Jun 19 '13 at 19:41
  • 1
    @user248237dfsf it might be worth posting that aspect as an [issue](https://github.com/pydata/pandas/issues)... – Andy Hayden Jun 19 '13 at 19:45
0

I met same error in 32-bit pytwhen using read_csv with 1GB file. Try 64-bit version and hopefully will solve Memory Error problem

Eric Wang
  • 1,009
  • 1
  • 9
  • 16
0

pd.concat seems to run out of memory for large dataframes as well, one option is to convert the dfs to matrixes and concat these.

def concat_df_by_np(df1,df2):
    """
    accepts two dataframes, converts each to a matrix, concats them horizontally and
    uses the index of the first dataframe. This is not a concat by index but simply by
    position, therefore the index of both dataframes should be the same
    """
    dfout = deepcopy(pd.DataFrame(np.concatenate( (df1.as_matrix(),df2.as_matrix()),axis=1),
                                  index   = df1.index, 
                                  columns = np.concatenate([df1.columns,df2.columns])))
    if (df1.index!=df2.index).any():
       #logging.warning('Indices in concat_df_by_np are not the same')                     
       print ('Indices in concat_df_by_np are not the same')                     


    return dfout

However, one needs to be careful as this function is not a join but rather a horizontal append while where the indices are ignored

horseshoe
  • 1,437
  • 14
  • 42