2

I need to use numpy functions to replace all Pandas functions, but the Pandas package did not explain well how pd.autocorr() is implemented.

import numpy as np
import pandas as pd

df = pd.DataFrame.from_dict({'A': np.random.random(20)})
x = df.rolling(5).apply(lambda x: x.autocorr(), raw=True).dropna()
y = []
for i in range(15):
  y.append( np.corrcoeff(df['A'][i:i+5],df['A'][i+1:i+6])[0,1] )
  # np.correlate(df['A'][i:i+5]-df['A'][i:i+5].mean(),df['A'][(1+i):(6+i)]-df['A'][(1+i):(6+i)].mean(),'valid')[0]
  # np.correlate(df['A'][i:i+5]-df['A'][i:i+5].mean(),np.flip(df['A'][(1+i):(6+i)])-df['A'][(1+i):(6+i)].mean(),'valid')[0]

The pd.autocorr() result is quite different from that of np.corrcoef() (I treid np.correlate() as well). Is there any way I can use numpy only functions to achieve the same reulst as pd.autocorr()?

----------------- Example result added ----------------

df['A'] = [0.5314742325906894, 0.7424912257400176, 0.2895649008872213, 0.16967710120380175, 0.5157732179121193, 0.8733423106397956, 0.585705172096987, 0.1387299202733231, 0.18540514459343538, 0.13913104211564564, 0.736937228263526, 0.20944078980434988, 0.2826810751427198, 0.15055686873748197, 0.4159491505728884, 0.07600226975854041, 0.15279939462562298, 0.1405723553409276, 0.8372449734938123, 0.3314986851097367]

x = [0.010637545587524432, 0.03594106077726333, 0.40104877005219836, -0.009106549297130558, 0.4008385963492408, 0.7794761931857483, -0.4182779136016351, -0.2962696925038811, -0.4083361773384266, -0.5244693987698964, -0.5063605533618415, -0.9496936641021706, -0.5303040575891907, -0.42881675192105184, -0.3371366910961831, -0.036231529863559424]

y = [0.11823200733266746, 0.16166841984627847, 0.2033980627120384, 0.2861039403548347, 0.5239653859040245, 0.1602079943122044, -0.3920837265006942, -0.28176746883177917, -0.3604612671108854, -0.5347077109231272, -0.4702461092101919, -0.5287673078857449, -0.4501452367448014, -0.3538574959825232, -0.10013342594129321]
jared
  • 4,165
  • 1
  • 8
  • 31
  • This is a great question. Do you have some test data to illustrate the difference? If so please add an example to your post. – Dima Chubarov Jul 12 '23 at 08:44

2 Answers2

2

If we check the doc of the pandas.Series.autocorr, if you call the function with default arguments, the lag is 1, which means you need to shift one element for calculating the correlation.

For example:

a = np.array([0.25, 0.5, 0.2, -0.05])
s = pd.Series(a)

gives you :

0.1035526330902407

With np.corrcoef you need to slice the array to two arrays shifted :

np.corrcoef(a[:-1], a[1:])[0, 1]

Which gives you same result:

0.1035526330902407

So in your case the codes should be like :

W = 5 # Window size
nrows = len(df) - W + 1 # number of elemnets after rolling
lag=1
y = []
for i in range(nrows):
    y.append(np.corrcoef(df['A'][i:i+W-lag],df['A'][i+lag:i+W])[0,1])

You will get same result as x.

HMH1013
  • 1,216
  • 2
  • 13
0

The pd.autocorr() computes the autocorrelation with a default lag=1. This means that it computes the correlation between an array and an array shifted by 1.

Here are a couple of ways to compute:

1. Looping through and appending for each np.corrcoef

You almost had the correct answer, but you just needed to offset by the lag=1 parameter:

y = []
for i in range(16):
    y.append(np.corrcoef(df.A[i:i+5][:-1], df.A[i:i+5][1:])[0, 1])

# 3.66 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The solution is a mix of your own working and this answer - which might give a bit more explanation for you.

2. Without loop, using np.lib.stride_tricks.sliding_window_view

np.lib.stride_tricks.sliding_window_view allows you to create an array of the sliding window. You can then slice this by the lag parameter and calculate the np.corrcoef, then take the diagonal output of the second half of the output array:

window = 5
lag = 1

y = np.diagonal(
    np.corrcoef(
        np.lib.stride_tricks.sliding_window_view(df.A, window)[:, :-lag],
        np.lib.stride_tricks.sliding_window_view(df.A, window)[:, lag:])
    [len(df.A) - window + 1:])

# 174 µs ± 2.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

As can be seen from the timings, the second option is far faster - although the code and concepts are perhaps a little more abstract.

Rawson
  • 2,637
  • 1
  • 5
  • 14