I discussed the performance issues of iterrows previously, and was given good general responses. This question is a specific case where I'd like your help in applying something better, as iterrows is SLOW.
I believe the question can be useful for any new python/pandas programmers who feel stuck thinking with a row iteration mentality.
The examples I've seen using 'map' or 'apply' generally show one datatable which seems intuitive enough. However, I am working across two tables and they are large (T1 is 2.5million rows, T2 is 96000 rows).
Here is a simple example (it works in my session):
import pandas as pd
import numpy as np
# Create the original tables
t1 = {'letter':['a','b'],
'number1':[50,-10]}
t2 = {'letter':['a','a','b','b'],
'number2':[0.2,0.5,0.1,0.4]}
table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)
# Create the body of the new table
table3 = pd.DataFrame(np.nan, columns=['letter','number2'], index=[0])
# Iterate through filtering relevant data, optimizing, returning info
for row_index, row in table1.iterrows():
t2info = table2[table2.letter == row['letter']].reset_index()
table3.ix[row_index,] = optimize(t2info,row['number1'])
# Define optimization
def optimize(t2info, t1info):
calculation = []
for index, r in t2info.iterrows():
calculation.append(r['number2']*t1info)
maxrow = calculation.index(max(calculation))
return t2info.ix[maxrow]
print table3
The output is:
letter number2
0 a 0.5
1 b 0.1
[2 rows x 2 columns]
General idea:
- Producing table 3 is the goal - it has the same dimensions as table 1
- Populate table 3 with the 'optimal' row from table 2, given the corresponding inputs from table 1.
- The data to use from table 2 is a subset based on the 'letter' from table 1
(Obviously this case is not slow because it is tiny, but when working with millions of rows it is. Bear in mind that in the real example I have more columns in both tables.)