3

When trying to merge two dataframes using pandas I receive this message: "ValueError: array is too big." I estimate the merged table will have about 5 billion rows, which is probably too much for my computer with 8GB of RAM (is this limited just by my RAM or is it built into the pandas system?).

I know that once I have the merged table I will calculate a new column and then filter the rows, looking for the maximum values within groups. Therefore the final output table will be only 2.5 million rows.

How can I break this problem up so that I can execute this merge method on smaller parts and build up the output table, without hitting my RAM limitations?

The method below works correctly for this small data, but fails on the larger, real data:

import pandas as pd
import numpy as np

# Create input tables
t1 = {'scenario':[0,0,1,1],
      'letter':['a','b']*2,
      'number1':[10,50,20,30]}

t2 = {'letter':['a','a','b','b'],
      'number2':[2,5,4,7]}

table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)

# Merge the two, create the new column. This causes "...array is too big."
table3 = pd.merge(table1,table2,on='letter')
table3['calc'] = table3['number1']*table3['number2']

# Filter, bringing back the rows where 'calc' is maximum per scenario+letter
table3 = table3.loc[table3.groupby(['scenario','letter'])['calc'].idxmax()]

This is a follow up to two previous questions:

Does iterrows have performance issues?

What is a good way to avoid using iterrows in this example?

I answer my own Q below.

Community
  • 1
  • 1
KieranPC
  • 8,525
  • 7
  • 22
  • 25

1 Answers1

0

You can break up the first table using groupby (for instance, on 'scenario'). It could make sense to first make a new variable which gives you groups of exactly the size you want. Then iterate through these groups doing the following on each: execute a new merge, filter and then append the smaller data into your final output table.

As explained in "Does iterrows have performance issues?", iterating is slow. Therefore try to use large groups to keep it using the most efficient methods possible. Pandas is relatively quick when it comes to merging.

Following on from after you create the input tables

table3 = pd.DataFrame()

grouped = table1.groupby('scenario')

for _, group in grouped: 
    temp = pd.merge(group,table2, on='letter')
    temp['calc']=temp['number1']*temp['number2']
    table3 = table3.append(temp.loc[temp.groupby('letter')['calc'].idxmax()])
    del temp
KieranPC
  • 8,525
  • 7
  • 22
  • 25
  • I'm attempting to be memory efficient by deleting temp each iteration. Is this value adding or not? I want to avoid creating objects that remain in memory. Is append() the right choice? I'd love to hear any other solutions to this 'too big to merge' problem, thanks. – KieranPC Jul 30 '14 at 22:01