Optimizing a pandas apply looking back at the prior row mid calculation

Question

I have a dataframe - pastebin for minimium code to run

df_dict = {
    'A': [1, 2, 3, 4, 5],
    'B': [5, 2, 3, 1, 5],
    'out': np.nan
}
df = pd.DataFrame(df_dict)

I am currently performing some row by row calculations by doing the following:

def transform(row):

    length = 2
    weight = 5
    row_num = int(row.name)
    out = row['A'] / length

    if (row_num >= length):
        previous_out = df.at[ row_num-1, 'out' ]
        out = (row['B'] - previous_out) * weight + previous_out

    df.at[row_num, 'out'] = out


df.apply( lambda x: transform(x), axis=1)

This yield the correct result:

   A  B    out
0  1  5    0.5
1  2  2    1.0
2  3  3   11.0
3  4  1  -39.0
4  5  5  181.0

The breakdown for the correct calculation is as follows:

   A  B    out
0  1  5    0.5   
out = a / b 

1  2  2    1.0
out = a / b 

row_num >= length:

2  3  3   11.0
out = (b - previous_out) * weight + previous_out
out = (3 - 1) * 5 + 1  = 11

3  4  1  -39.0
out = (1 - 11) * 5 + 11 = 39

4  5  5  181.0
out = (5 - (-39)) * 5 + (-39) = 181

Executing this across many columns and looping is slow so I would like to optimize taking advantage of some kind of vectorization if possible.

My current attempt looks like this:

df['out'] = df['A'] / length
df[length:]['out'] = (df[length:]['B'] - df[length:]['out'].shift() ) * weight + df[length:]['out'].shift()

This is not working and I'm not quite sure where to go from here.

Pastebin of the above code to just copy/paste into a file and run

Can you give exammple calculations for some row? For example how do you get `181.0` — yatu, Feb 06 '19 at 11:06
I have updated the example to better show the logic and calculations and also included that breakdown in the updated pastebin — ManreeRist, Feb 06 '19 at 11:14
As an answer says, this is a natural loop. You might want to look into Cython http://nealhughes.net/cython1/ — ajrwhite, Feb 06 '19 at 11:53

score 2 · Accepted Answer · answered Feb 06 '19 at 11:31

2

You won't be able to do better than this:

df['out'] = df.A / length
for i in range(len(df)):
    if i >= length:
        df.loc[i, 'out'] = (df.loc[i, 'B'] - 
                df.loc[i - 1, 'out']) * weight + df.loc[i - 1, 'out']

The reason is that "the iterative nature of the calculation where the inputs depend on results of previous steps complicates vectorization" (as a commenter here puts it). You can't do a calculation where every result depends on the previous ones in a matrix - there will always be some kind of loop going on behind the scenes.

answered Feb 06 '19 at 11:31

Josh Friedlander

10,870
5
35
75

I understand what you're saying, say I have 5 x 10,000 row df's... and I need to do this calculation on 5 columns, theres 250,000 executions already, add in the 4 x .loc command, all of a sudden we're up to 1m total lookups. Is there maybe not a better way to write the equation to allow this to become a more efficient operation? – ManreeRist Feb 06 '19 at 11:39
Sems to me like a fundamental limitation of vector algebra. But I'm a rank amateur, would be happy to be proved wrong by one of the more seasoned SOers. – Josh Friedlander Feb 06 '19 at 11:43
Also, all the answers in the thread I linked are relevant to you. Note especially jpp's recommendation of `numba`, which compiles straight into machine code. You'll still have to loop but your execution will be much faster. – Josh Friedlander Feb 06 '19 at 11:51
1

Thanks Josh, i've just accepted as answer now as it looks like thats just the way it is. Thanks for the link will read more, funny you posted as this is one i'd been looking at before also. – ManreeRist Feb 06 '19 at 12:02

Optimizing a pandas apply looking back at the prior row mid calculation

1 Answers1