0

This question is linked to my other question here.

I want to calculate rolling exponentially-weighted correlation between two columns in a pandas Dataframe.

import pandas as pd
import numpy as np

n_obs = 10000
window = 260
min_obs = 130
halflife = 130

# create a dummy dataframe
df = pd.DataFrame(np.random.rand(n_obs, 2), columns=['A', 'B'])

# initialize weights
w = 0.5**((window-np.arange(window)-1)/(halflife-1))

# correlation for a single sliding window. 
# note because of `min_obs`, df_win might be of various lengths
def _corr_single_window_(df_win):
    return df_win.mul(w[-df_win.shape[0]:], axis=0).corr().iloc[0, 1]

# test on the dummy dataframe
df.rolling(window=window, min_periods=min_obs).apply(_corr_single_window_)

And I got this error:

DataError: No numeric types to aggregate

data-monkey
  • 1,535
  • 3
  • 15
  • 24
  • 1
    What are the contents of `df`? – sechstein Jul 06 '23 at 16:06
  • The obvious question: Does your dataframe contain numeric data? See [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and [edit] your question to include a sample of your input so that we can better understand your question – G. Anderson Jul 06 '23 at 16:40
  • I intended to show the creation of the dummy dataframe but forgot to write `df = ...`. Edited the question now. – data-monkey Jul 07 '23 at 08:12

1 Answers1

0

The reason you receive this error is because df_win is actually a pd.Series. The .rolling.apply does not pass a pd.DataFrame, so the correlation calculation is not possible - as there is only a single series of data. There is more of an explanation in this answer.

Here are a few options for computing the correlation in your question:

Option 1 - rolling apply

Using almost the same code as yours, I have adapted this to instead use the global variable df at the start of the return code, so that both columns are used, rather than the single column in df_win (as this is a pd.Series. Also, I have changed the line calling the function to only run for a single column, else the output is returned twice.

def _corr_single_window_(df_win):
    return df.loc[df_win.index].mul(w[-df_win.shape[0]:], axis=0).corr().iloc[0, 1]

df.rolling(window=window, min_periods=min_obs)["A"].apply(_corr_single_window_)

# 18.2 s ± 437 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Option 2 - list comprehension

This list comprehension computes has an if else statement for the min_obs, and uses max(0, i-window) as the lower bound in .iloc so that this increases from the start of min_obs to the full window length. If you are using Python>=3.8 then you can use a "walrus operator" within to not repeat this calculation.

pd.Series([df.iloc[max(0, i-window): i]
           .mul(w[-(i-max(0, i-window)):], axis=0)
           .corr().iloc[0, 1]
           if i>=min_obs
           else np.nan
           for i in range(1, len(df) + 1)])
# 12.8 s ± 263 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# (or with walrus operators for Python >= 3.8)
pd.Series([df.iloc[(min_idx := max(0, i-window)): i]
           .mul(w[-(i-min_idx):], axis=0)
           .corr().iloc[0, 1]
           if i>=min_obs
           else np.nan
           for i in range(1, len(df)+1)])


# 12.8 s ± 263 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Option 3 - with rolling pipe

Mostly adapted from this answer, this uses .pipe to chain functions together.

def rolling_pipe(dataframe, window, fctn):
    return pd.Series([dataframe.iloc[max(0, i-window): i].pipe(fctn) 
                      if i >= window else None 
                      for i in range(1, len(dataframe)+1)],
                     index = dataframe.index) 

df.pipe(rolling_pipe, window,
        lambda x: x.mul(w[-x.shape[0]:], axis=0).corr().iloc[0, 1])

# 7.66 s ± 1.9 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

As can be seen in the timings above, the third option is most efficient. However, given the large number of observations only 7 runs were made and testing multiple times gave quite varied timings.

Rawson
  • 2,637
  • 1
  • 5
  • 14
  • I understand all three options, but to me Option 1 is most readable. I now get the feeling that `.rolling().apply()` is too generalized and too strict at the same time. There must be a more elegant AND efficient way of doing rolling-on-dataframe-and-output-single-value-for-each-window! – data-monkey Jul 09 '23 at 11:39