Pandas loop taking way to much time - better way?

Question

I have a loop that is taking way too much time and I wonder if there is a better way? Or if I am making rookie mistakes?

The reason I am doing a loop is that the first value is different and the need for previous values.

# create var and set to 0
df [ 'amt_model' ] = 0

# create the cashflow variable
df [ 'cf' ] = df [ 'cash_in' ] - df [ 'cash_out' ] + df [ 'transfer' ]

Now I loop through the range of months to create the 'amt_model' values.

for i in range ( len ( df ) ):

    # adjust for the first month
    if i == 0:
        df [ 'amt_model' ].iloc [ i ] = df [ 'contrib' ].iloc [ i ]

    else:

        amt1 = df [ 'amt_model' ].iloc [ i - 1 ] * (1 + df [ 'pct_model' ].iloc [ i ])
        amt2 = df [ 'cf' ] [ i ] * (1 + df [ 'pct_model' ].iloc [ i ] / 2)

        df [ 'amt_model' ].iloc [ i ] = amt1 + amt2

This is taking up way too much time to loop through only 20 or 50 values.

index_values- start 19:28
index_values - end 19:42

Thanks!

Please, can you at least take the time to write one question with a [mcve], including data? — cs95, Feb 08 '18 at 04:48
be happy to, but I don't fully understand what you would like? sorry, I am not a coder by profession. This code is a just a small part of a far, far bigger app. way too much to post. Just looking for someone to proof my methodology. Are there obviously errors causing it to eat so much time? Thanks! — diogenes, Feb 08 '18 at 05:00
Check this out: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — cs95, Feb 08 '18 at 05:04

David Leon · Accepted Answer · 2018-02-08T08:55:29.417

My solution, with:

df = pd.DataFrame(columns=['cf','cash_in','cash_out','transfer','contrib','pct_model'])
for c in df.columns:
    df[c] = np.random.rand(100)*100

print(df.head())

          cf    cash_in   cash_out   transfer    contrib  pct_model
0  18.478061  80.073920  19.041986   8.859406  85.695653  18.174608
1  96.172043  72.786434  54.215755  76.859253  87.934012  47.415420
2  79.026521  63.252437  29.094382  23.460806  30.547062  36.154976
3  64.630058  85.409417  98.469148  84.905463  32.859257  75.908211
4  54.121041   8.823944  48.835937   5.194054  17.004900  25.130477

Iterate over rows to create a new array and assign to df

#amt_model is your future column
amt_model = [df.loc[0,'contrib']] #init with first row

#Calling df[1:] will get all your df except first row, iterate over it
for i, row in df[1:].iterrows():
    _amt_model = amt_model[-1] * (1 + row.pct_model)
    amt_model.append( _amt_model + row.cf * (1 + row.pct_model/2))

df['amt_model'] = amt_model #assign to your df

print(df.amt_model.head())

0    8.569565e+01
1    6.525182e+03
2    2.439506e+05
3    1.876432e+07
4    4.903214e+08
Name: amt_model, dtype: float64

Performances : 100 loops, best of 3: 13.7 ms per loop

Is that what you could expect?

Alternatives

If yes you can try it in one line:

Opt1:

amt_model = [df.loc[0,'contrib']]
[amt_model.append( amt_model[-1] * (1 + row.pct_model) + row.cf * (1 + row.pct_model/2) ) 
for (i,row) in df[1:].iterrows()]

df['amt_model'] = amt_model

#Performances:   
100 loops, best of 3: 14.7 ms per loop

Opt2 - With apply:

amt_model = [df.loc[0,'contrib']]
df[1:].apply(lambda row: amt_model.append( amt_model[-1] * (1 + row.pct_model) + row.cf * (1 + row.pct_model/2) ),
             axis='columns')

df['amt_model'] = amt_model

#Performances:
100 loops, best of 3: 11.7 ms per loop

When you use shift for amt_model column you get only '0'. It need update in each iteration. — CezarySzulc, Feb 08 '18 at 07:57
interesting - thank you very much. will test them both and let you know — diogenes, Feb 09 '18 at 08:48
ok, you are welcome. Tell me if some unexpected behaviour happens — David Leon, Feb 09 '18 at 11:54
WOW! using the first solution of creating a list and then putting that list into the df at the end - just changed life as I know it. I went from almost 16 seconds to 1 second. Now I need to rewrite a lot more code based on this approach. THANK YOU! — diogenes, Feb 12 '18 at 06:31

score 0 · Answer 2 · answered Feb 08 '18 at 07:55

You can upgrade it by pull 'amt2' from loop. I will use something like this:

df['amt2'] = df [ 'cf' ] * (1 + df [ 'pct_model' ] / 2)
df['amt1_1'] = 1 + df[ 'pct_model' ]

for i in range(len( df)):
    # adjust for the first month
    if i == 0:
        df [ 'amt_model' ].iloc [ i ] = df [ 'contrib' ].iloc [ i ]
    else:
        amt1 = df [ 'amt_model' ].iloc [ i - 1 ] * df['amt1_1'].iloc[i]

    df [ 'amt_model' ].iloc [ i ] = amt1 + df['amt2'].iloc[i]

You need upgrade 'amt_model' variable in each iteration so I don't see any diffrent option.

score 0 · Answer 3 · answered Feb 08 '18 at 08:26

0

Did you try this one

df.loc[0,'amt_model' ] = df.loc[0,'contrib']
amt1 = (df.loc[:(len(df)-2),'amt_model']) * (1 + df.loc[1:, 'pct_model'].reset_index(drop=True))
amt2 = (df[ 'cf' ]) * (1 + df[ 'pct_model' ]/2)
df['amt_model'] = amt1 + amt2

Taking len(df)-2 gives you the t-1 value, and df.iloc[1:] gives you the t value. Same length.

answered Feb 08 '18 at 08:26

J. Doe

3,458
2
24
42

I was hoping to find a way without doing a loop. Cant wait to try. Thanks! – diogenes Feb 09 '18 at 08:48

Pandas loop taking way to much time - better way?

3 Answers3