3

I'm trying to create a new Pandas DataFrame column using shifted values of the column being created itself.

The only way I've been able to do so is by iterating through the data which is too slow and causing a bottleneck in my code.

import pandas as pd 

df = pd.DataFrame([1,6,2,8], columns=['a'])
df.at[0, 'b'] = 5

for i in range(1, len(df)):
    df.loc[i, ('b')] = (df.a[i-1] + df.b[i-1]) /2

I tried using shift but it didn't work. It fills in the value for row 1 and NaN for the rest. I'm assuming this method can't read newly created values on the fly.

df.loc[1:, ('b')] = (df.a.shift() + df.b.shift()) /2

UPDATE

I was able to significantly reduce the timing by using df.at in the iteration rather than df.loc

def with_df_loc(df):
    for i in range(1, len(df)):
        df.loc[i, ('b')] = (df.a[i-1] + df.b[i-1]) /2
    return df

def with_df_at(df):
    for i in range(1, len(df)):
        df.at[i, 'b'] = (df.a[i-1] + df.b[i-1]) /2
    return df



%timeit with_df_loc(df)
183 ms ± 75.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit with_df_at(df)
19.4 ms ± 2.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

This timing is based on a larger dataset of 150 rows. Considering that df.rolling(20).mean() takes about 3ms, I think this might be the best I can do.

Thanks for the answers, if I need to further optimize I'll look into Asish M's suggestion of numba.

JDavda
  • 33
  • 4
  • pandas does not really have easy methods to deal with recursive stuff - look into `numba` to speed up that for loop. – Asish M. Dec 26 '20 at 09:55
  • as an alternative - [scipy.signal.filter](http://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.lfilter.html) seems useful here - see https://stackoverflow.com/questions/26267809/recursive-definitions-in-pandas – Asish M. Dec 26 '20 at 10:06
  • Are you still looking for a more performant solution to the loop above? I'm a bit confused because the questions and answers don't match. – cs95 Dec 26 '20 at 10:50
  • @cs95 I edited the question a bit since one of the answers worked but only because the df['a'] column values were in numerical order where it should work if the column contained random values. I was able to speed it up by using ```df.at``` instead of ```df.loc``` in the iteration, but if you know of another way, I'm eager to hear it. Always looking for ways to improve code. – JDavda Dec 26 '20 at 11:26

3 Answers3

3

We can use numba to speed up calculations here, see Enhancing performance section in the docs.

import numba 

@numba.njit
def func(a, b_0=5):
    n = len(a)
    b = np.full(n, b_0, dtype=np.float64)
    for i in range(1, n):
        b[i] = (b[i - 1] + a[i - 1]) / 2
    return b

df['b'] = func(df['a'].to_numpy())
df

   a     b
0  1  5.00
1  6  3.00
2  2  4.50
3  8  3.25

Comparing performance

Benchmarking code, for reference.

enter image description here

The blue line represents the performance of the fastest version of your current method (using .at). The orange line represents the numba's performance.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • Thank you @cs95 - I was able to use your code and reduce the time down 9ms (using a different larger df) compared to 19.4ms using ```df.at``` which is amazing. I'm really happy about this. quick question, in your [benchmark code](https://gist.github.com/Coldsp33d/cc71e95f7edda25ee798c99195d245ab#file-shifted_calculations_benchmark-py) I didn't see the ```func``` function or a ```numba``` import - what am I missing? – JDavda Dec 26 '20 at 12:41
  • 2
    @cs95 great answer sir, keep up the good work :) – Shubham Sharma Dec 26 '20 at 13:03
  • 1
    @cs95 I've marked it answer accepted - thanks again, working with ```numba``` has been on a my list for a while, appreciate you showing the way! – JDavda Dec 26 '20 at 13:13
1

You could try shift + cumsum, starting from 5 with fillna:

import pandas as pd


df = pd.DataFrame([1,2,3,4], columns=['a'])
df['b'] = df['a'].shift().fillna(5).cumsum()

print(df)

Output

   a     b
0  1   5.0
1  2   6.0
2  3   8.0
3  4  11.0
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • Thanks, I may be able to use fillna() rather than my current method of df.at but the calculation needs to be based on the previous value of a+b. In this case your code works because the 'a' value are 1,2,3,4 but I simplified my actual code for this example - it's random values for 'a' in my original code. I'll edit the question to clear that up – JDavda Dec 26 '20 at 09:08
0

I most likely would have misconstrued your question, but try this if you're looking to create a shifted column:

df = pd.DataFrame([1,2,3,4], columns=['a'])
df["b"] = df.a.shift()
Tony Ng
  • 164
  • 2
  • 12