I'm trying to create a new Pandas DataFrame column using shifted values of the column being created itself.
The only way I've been able to do so is by iterating through the data which is too slow and causing a bottleneck in my code.
import pandas as pd
df = pd.DataFrame([1,6,2,8], columns=['a'])
df.at[0, 'b'] = 5
for i in range(1, len(df)):
df.loc[i, ('b')] = (df.a[i-1] + df.b[i-1]) /2
I tried using shift but it didn't work. It fills in the value for row 1 and NaN for the rest. I'm assuming this method can't read newly created values on the fly.
df.loc[1:, ('b')] = (df.a.shift() + df.b.shift()) /2
UPDATE
I was able to significantly reduce the timing by using df.at
in the iteration rather than df.loc
def with_df_loc(df):
for i in range(1, len(df)):
df.loc[i, ('b')] = (df.a[i-1] + df.b[i-1]) /2
return df
def with_df_at(df):
for i in range(1, len(df)):
df.at[i, 'b'] = (df.a[i-1] + df.b[i-1]) /2
return df
%timeit with_df_loc(df)
183 ms ± 75.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit with_df_at(df)
19.4 ms ± 2.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
This timing is based on a larger dataset of 150 rows. Considering that df.rolling(20).mean()
takes about 3ms, I think this might be the best I can do.
Thanks for the answers, if I need to further optimize I'll look into Asish M's suggestion of numba
.