When trying to merge two dataframes using pandas I receive this message: "ValueError: array is too big." I estimate the merged table will have about 5 billion rows, which is probably too much for my computer with 8GB of RAM (is this limited just by my RAM or is it built into the pandas system?).
I know that once I have the merged table I will calculate a new column and then filter the rows, looking for the maximum values within groups. Therefore the final output table will be only 2.5 million rows.
How can I break this problem up so that I can execute this merge method on smaller parts and build up the output table, without hitting my RAM limitations?
The method below works correctly for this small data, but fails on the larger, real data:
import pandas as pd
import numpy as np
# Create input tables
t1 = {'scenario':[0,0,1,1],
'letter':['a','b']*2,
'number1':[10,50,20,30]}
t2 = {'letter':['a','a','b','b'],
'number2':[2,5,4,7]}
table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)
# Merge the two, create the new column. This causes "...array is too big."
table3 = pd.merge(table1,table2,on='letter')
table3['calc'] = table3['number1']*table3['number2']
# Filter, bringing back the rows where 'calc' is maximum per scenario+letter
table3 = table3.loc[table3.groupby(['scenario','letter'])['calc'].idxmax()]
This is a follow up to two previous questions:
Does iterrows have performance issues?
What is a good way to avoid using iterrows in this example?
I answer my own Q below.