The reason you receive this error is because df_win
is actually a pd.Series
. The .rolling.apply
does not pass a pd.DataFrame
, so the correlation calculation is not possible - as there is only a single series of data. There is more of an explanation in this answer.
Here are a few options for computing the correlation in your question:
Option 1 - rolling apply
Using almost the same code as yours, I have adapted this to instead use the global variable df
at the start of the return code, so that both columns are used, rather than the single column in df_win
(as this is a pd.Series
. Also, I have changed the line calling the function to only run for a single column, else the output is returned twice.
def _corr_single_window_(df_win):
return df.loc[df_win.index].mul(w[-df_win.shape[0]:], axis=0).corr().iloc[0, 1]
df.rolling(window=window, min_periods=min_obs)["A"].apply(_corr_single_window_)
# 18.2 s ± 437 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Option 2 - list comprehension
This list comprehension computes has an if else statement for the min_obs
, and uses max(0, i-window)
as the lower bound in .iloc
so that this increases from the start of min_obs
to the full window
length. If you are using Python>=3.8
then you can use a "walrus operator" within to not repeat this calculation.
pd.Series([df.iloc[max(0, i-window): i]
.mul(w[-(i-max(0, i-window)):], axis=0)
.corr().iloc[0, 1]
if i>=min_obs
else np.nan
for i in range(1, len(df) + 1)])
# 12.8 s ± 263 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# (or with walrus operators for Python >= 3.8)
pd.Series([df.iloc[(min_idx := max(0, i-window)): i]
.mul(w[-(i-min_idx):], axis=0)
.corr().iloc[0, 1]
if i>=min_obs
else np.nan
for i in range(1, len(df)+1)])
# 12.8 s ± 263 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Option 3 - with rolling pipe
Mostly adapted from this answer, this uses .pipe
to chain functions together.
def rolling_pipe(dataframe, window, fctn):
return pd.Series([dataframe.iloc[max(0, i-window): i].pipe(fctn)
if i >= window else None
for i in range(1, len(dataframe)+1)],
index = dataframe.index)
df.pipe(rolling_pipe, window,
lambda x: x.mul(w[-x.shape[0]:], axis=0).corr().iloc[0, 1])
# 7.66 s ± 1.9 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
As can be seen in the timings above, the third option is most efficient. However, given the large number of observations only 7 runs were made and testing multiple times gave quite varied timings.