How can I obtain the same result as pandas.autocorr() by numpy?

Question

I need to use numpy functions to replace all Pandas functions, but the Pandas package did not explain well how pd.autocorr() is implemented.

import numpy as np
import pandas as pd

df = pd.DataFrame.from_dict({'A': np.random.random(20)})
x = df.rolling(5).apply(lambda x: x.autocorr(), raw=True).dropna()
y = []
for i in range(15):
  y.append( np.corrcoeff(df['A'][i:i+5],df['A'][i+1:i+6])[0,1] )
  # np.correlate(df['A'][i:i+5]-df['A'][i:i+5].mean(),df['A'][(1+i):(6+i)]-df['A'][(1+i):(6+i)].mean(),'valid')[0]
  # np.correlate(df['A'][i:i+5]-df['A'][i:i+5].mean(),np.flip(df['A'][(1+i):(6+i)])-df['A'][(1+i):(6+i)].mean(),'valid')[0]

The pd.autocorr() result is quite different from that of np.corrcoef() (I treid np.correlate() as well). Is there any way I can use numpy only functions to achieve the same reulst as pd.autocorr()?

----------------- Example result added ----------------

df['A'] = [0.5314742325906894, 0.7424912257400176, 0.2895649008872213, 0.16967710120380175, 0.5157732179121193, 0.8733423106397956, 0.585705172096987, 0.1387299202733231, 0.18540514459343538, 0.13913104211564564, 0.736937228263526, 0.20944078980434988, 0.2826810751427198, 0.15055686873748197, 0.4159491505728884, 0.07600226975854041, 0.15279939462562298, 0.1405723553409276, 0.8372449734938123, 0.3314986851097367]

x = [0.010637545587524432, 0.03594106077726333, 0.40104877005219836, -0.009106549297130558, 0.4008385963492408, 0.7794761931857483, -0.4182779136016351, -0.2962696925038811, -0.4083361773384266, -0.5244693987698964, -0.5063605533618415, -0.9496936641021706, -0.5303040575891907, -0.42881675192105184, -0.3371366910961831, -0.036231529863559424]

y = [0.11823200733266746, 0.16166841984627847, 0.2033980627120384, 0.2861039403548347, 0.5239653859040245, 0.1602079943122044, -0.3920837265006942, -0.28176746883177917, -0.3604612671108854, -0.5347077109231272, -0.4702461092101919, -0.5287673078857449, -0.4501452367448014, -0.3538574959825232, -0.10013342594129321]

This is a great question. Do you have some test data to illustrate the difference? If so please add an example to your post. — Dima Chubarov, Jul 12 '23 at 08:44

HMH1013 · Accepted Answer · 2023-07-12T14:28:38.453

If we check the doc of the pandas.Series.autocorr, if you call the function with default arguments, the lag is 1, which means you need to shift one element for calculating the correlation.

For example:

a = np.array([0.25, 0.5, 0.2, -0.05])
s = pd.Series(a)

gives you :

0.1035526330902407

With np.corrcoef you need to slice the array to two arrays shifted :

np.corrcoef(a[:-1], a[1:])[0, 1]

Which gives you same result:

0.1035526330902407

So in your case the codes should be like :

W = 5 # Window size
nrows = len(df) - W + 1 # number of elemnets after rolling
lag=1
y = []
for i in range(nrows):
    y.append(np.corrcoef(df['A'][i:i+W-lag],df['A'][i+lag:i+W])[0,1])

You will get same result as x.

score 0 · Answer 2 · answered Jul 12 '23 at 11:38

The pd.autocorr() computes the autocorrelation with a default lag=1. This means that it computes the correlation between an array and an array shifted by 1.

Here are a couple of ways to compute:

1. Looping through and appending for each `np.corrcoef`

You almost had the correct answer, but you just needed to offset by the lag=1 parameter:

y = []
for i in range(16):
    y.append(np.corrcoef(df.A[i:i+5][:-1], df.A[i:i+5][1:])[0, 1])

# 3.66 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The solution is a mix of your own working and this answer - which might give a bit more explanation for you.

2. Without loop, using `np.lib.stride_tricks.sliding_window_view`

np.lib.stride_tricks.sliding_window_view allows you to create an array of the sliding window. You can then slice this by the lag parameter and calculate the np.corrcoef, then take the diagonal output of the second half of the output array:

window = 5
lag = 1

y = np.diagonal(
    np.corrcoef(
        np.lib.stride_tricks.sliding_window_view(df.A, window)[:, :-lag],
        np.lib.stride_tricks.sliding_window_view(df.A, window)[:, lag:])
    [len(df.A) - window + 1:])

# 174 µs ± 2.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

As can be seen from the timings, the second option is far faster - although the code and concepts are perhaps a little more abstract.

How can I obtain the same result as pandas.autocorr() by numpy?

2 Answers2

1. Looping through and appending for each np.corrcoef

2. Without loop, using np.lib.stride_tricks.sliding_window_view

1. Looping through and appending for each `np.corrcoef`

2. Without loop, using `np.lib.stride_tricks.sliding_window_view`