0

I have a dataframe - pastebin for minimium code to run

df_dict = {
    'A': [1, 2, 3, 4, 5],
    'B': [5, 2, 3, 1, 5],
    'out': np.nan
}
df = pd.DataFrame(df_dict)

I am currently performing some row by row calculations by doing the following:

def transform(row):

    length = 2
    weight = 5
    row_num = int(row.name)
    out = row['A'] / length

    if (row_num >= length):
        previous_out = df.at[ row_num-1, 'out' ]
        out = (row['B'] - previous_out) * weight + previous_out

    df.at[row_num, 'out'] = out


df.apply( lambda x: transform(x), axis=1)

This yield the correct result:

   A  B    out
0  1  5    0.5
1  2  2    1.0
2  3  3   11.0
3  4  1  -39.0
4  5  5  181.0

The breakdown for the correct calculation is as follows:

   A  B    out
0  1  5    0.5   
out = a / b 

1  2  2    1.0
out = a / b 

row_num >= length:

2  3  3   11.0
out = (b - previous_out) * weight + previous_out
out = (3 - 1) * 5 + 1  = 11

3  4  1  -39.0
out = (1 - 11) * 5 + 11 = 39

4  5  5  181.0
out = (5 - (-39)) * 5 + (-39) = 181

Executing this across many columns and looping is slow so I would like to optimize taking advantage of some kind of vectorization if possible.

My current attempt looks like this:

df['out'] = df['A'] / length
df[length:]['out'] = (df[length:]['B'] - df[length:]['out'].shift() ) * weight + df[length:]['out'].shift()

This is not working and I'm not quite sure where to go from here.

Pastebin of the above code to just copy/paste into a file and run

ManreeRist
  • 504
  • 5
  • 12

1 Answers1

2

You won't be able to do better than this:

df['out'] = df.A / length
for i in range(len(df)):
    if i >= length:
        df.loc[i, 'out'] = (df.loc[i, 'B'] - 
                df.loc[i - 1, 'out']) * weight + df.loc[i - 1, 'out']

The reason is that "the iterative nature of the calculation where the inputs depend on results of previous steps complicates vectorization" (as a commenter here puts it). You can't do a calculation where every result depends on the previous ones in a matrix - there will always be some kind of loop going on behind the scenes.

Josh Friedlander
  • 10,870
  • 5
  • 35
  • 75
  • I understand what you're saying, say I have 5 x 10,000 row df's... and I need to do this calculation on 5 columns, theres 250,000 executions already, add in the 4 x .loc command, all of a sudden we're up to 1m total lookups. Is there maybe not a better way to write the equation to allow this to become a more efficient operation? – ManreeRist Feb 06 '19 at 11:39
  • Sems to me like a fundamental limitation of vector algebra. But I'm a rank amateur, would be happy to be proved wrong by one of the more seasoned SOers. – Josh Friedlander Feb 06 '19 at 11:43
  • Also, all the answers in the thread I linked are relevant to you. Note especially jpp's recommendation of `numba`, which compiles straight into machine code. You'll still have to loop but your execution will be much faster. – Josh Friedlander Feb 06 '19 at 11:51
  • 1
    Thanks Josh, i've just accepted as answer now as it looks like thats just the way it is. Thanks for the link will read more, funny you posted as this is one i'd been looking at before also. – ManreeRist Feb 06 '19 at 12:02