What is a good way to avoid using iterrows in this example?

Question

I discussed the performance issues of iterrows previously, and was given good general responses. This question is a specific case where I'd like your help in applying something better, as iterrows is SLOW.

I believe the question can be useful for any new python/pandas programmers who feel stuck thinking with a row iteration mentality.

The examples I've seen using 'map' or 'apply' generally show one datatable which seems intuitive enough. However, I am working across two tables and they are large (T1 is 2.5million rows, T2 is 96000 rows).

Here is a simple example (it works in my session):

import pandas as pd
import numpy as np

# Create the original tables
t1 = {'letter':['a','b'],
      'number1':[50,-10]}

t2 = {'letter':['a','a','b','b'],
      'number2':[0.2,0.5,0.1,0.4]}

table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)

# Create the body of the new table
table3 = pd.DataFrame(np.nan, columns=['letter','number2'], index=[0])

# Iterate through filtering relevant data, optimizing, returning info
for row_index, row in table1.iterrows():   
    t2info = table2[table2.letter == row['letter']].reset_index()
    table3.ix[row_index,] = optimize(t2info,row['number1'])

# Define optimization
def optimize(t2info, t1info):
    calculation = []
    for index, r in t2info.iterrows():
        calculation.append(r['number2']*t1info)
    maxrow = calculation.index(max(calculation))
    return t2info.ix[maxrow]

print table3

The output is:

  letter number2
0      a     0.5
1      b     0.1

[2 rows x 2 columns]

General idea:

Producing table 3 is the goal - it has the same dimensions as table 1
Populate table 3 with the 'optimal' row from table 2, given the corresponding inputs from table 1.
The data to use from table 2 is a subset based on the 'letter' from table 1

(Obviously this case is not slow because it is tiny, but when working with millions of rows it is. Bear in mind that in the real example I have more columns in both tables.)

Merging T1 and T2 to start with would give 5 billion rows so I imagine I need to avoid that (unless I underestimate the memory of my computer) — KieranPC, Jul 21 '14 at 21:39
I think you misunderstand merging -- it takes the intersection of the column you merge on. Merging your example dataframes gives a dataframe with four rows -- not 8. — exp1orer, Jul 21 '14 at 21:43
(Depending on what you pass to the `how` parameter. `inner` is the default.) — exp1orer, Jul 21 '14 at 21:51
Unfortunatley not. I get this message: "ValueError: array is too big." I'm pretty sure that I will get 5 billion rows having looked into the data (I agree it is not creating a cartesian product). I plan on trying itertools with the groupby feature. I might make two grouped objects, one for each table, to start. Then iterate to find the "matching" groups. I will then merge and apply on each as you have done, aggregating to a new table. If you know how to do that I'd be grateful to see it on this (tiny) example. If I succeed, I'll post it myself :) — KieranPC, Jul 29 '14 at 17:14

score 3 · Answer 1 · answered Jul 21 '14 at 21:42

To me, looks like the easiest thing is to merge on letter and then groupby.

import pandas as pd
import numpy as np

# Create the original tables
t1 = {'letter':['a','b'],
      'number1':[50,-10]}

t2 = {'letter':['a','a','b','b'],
      'number2':[0.2,0.5,0.1,0.4]}

table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)

table3 = table1.merge(table2,on='letter')

grouped = table3.groupby('letter')

def get_optimization(df):
    product_column = df.number1 * df.number2
    idx_of_prod_col_max = product_columns.idxmax()
    return_val = df.ix[idx_of_prod_col_max]['number2']
    return return_val

table3 = grouped.apply(get_optimization)

score 0 · Answer 2 · answered Aug 15 '17 at 15:07

As I posted in the other answer, sometimes the problem isn't about the loops, but unnecessarily boxing data into DataFrame or Series.

def iterthrough():
    ret = []
    grouped = table2.groupby('letter', sort=False)
    t2info = table2.to_records()
    for index, letter, n1 in table1.to_records():
        t2 = t2info[grouped.groups[letter].values]
        maxrow = np.multiply(t2.number2, n1).argmax()
        # `[1:]` removes the index column
        ret.append(t2[maxrow].tolist()[1:])
    return pd.DataFrame(ret, columns=('letter', 'number2'))

Ways to improve include:

Use groupby and groups indexing to avoid repetitive boolean calculation.
Use to_records to avoid converting records from and to Series.
Don't created the DataFrame until you have compiled all its data.

What is a good way to avoid using iterrows in this example?

2 Answers2

Linked