Estimate Autocorrelation using Python

Question

I would like to perform Autocorrelation on the signal shown below. The time between two consecutive points is 2.5ms (or a repetition rate of 400Hz).

enter image description here

This is the equation for estimating autoacrrelation that I would like to use (Taken from http://en.wikipedia.org/wiki/Autocorrelation, section Estimation):

enter image description here

What is the simplest method of finding the estimated autocorrelation of my data in python? Is there something similar to numpy.correlate that I can use?

Or should I just calculate the mean and variance?

Edit:

With help from unutbu, I have written:

from numpy import *
import numpy as N
import pylab as P

fn = 'data.txt'
x = loadtxt(fn,unpack=True,usecols=[1])
time = loadtxt(fn,unpack=True,usecols=[0]) 

def estimated_autocorrelation(x):
    n = len(x)
    variance = x.var()
    x = x-x.mean()
    r = N.correlate(x, x, mode = 'full')[-n:]
    #assert N.allclose(r, N.array([(x[:n-k]*x[-(n-k):]).sum() for k in range(n)]))
    result = r/(variance*(N.arange(n, 0, -1)))
    return result

P.plot(time,estimated_autocorrelation(x))
P.xlabel('time (s)')
P.ylabel('autocorrelation')
P.show()

http://stackoverflow.com/questions/4503325/autocorrelation-of-a-multidimensional-array-in-numpy — Abid Rahman K, Jan 12 '13 at 19:50
I wanted to speak more specifically about the estimated autocorrelation equation. — 8765674, Jan 12 '13 at 20:04
See also: http://stackoverflow.com/questions/643699/how-can-i-use-numpy-correlate-to-do-autocorrelation and http://stackoverflow.com/questions/12269834/is-there-any-numpy-autocorrellation-function-with-standardized-output — amcnabb, Oct 10 '13 at 18:38

score 34 · Accepted Answer · answered Jan 12 '13 at 22:33

34

I don't think there is a NumPy function for this particular calculation. Here is how I would write it:

def estimated_autocorrelation(x):
    """
    http://stackoverflow.com/q/14297012/190597
    http://en.wikipedia.org/wiki/Autocorrelation#Estimation
    """
    n = len(x)
    variance = x.var()
    x = x-x.mean()
    r = np.correlate(x, x, mode = 'full')[-n:]
    assert np.allclose(r, np.array([(x[:n-k]*x[-(n-k):]).sum() for k in range(n)]))
    result = r/(variance*(np.arange(n, 0, -1)))
    return result

The assert statement is there to both check the calculation and to document its intent.

When you are confident this function is behaving as expected, you can comment-out the assert statement, or run your script with python -O. (The -O flag tells Python to ignore assert statements.)

answered Jan 12 '13 at 22:33

unutbu

842,883
184
1,785
1,677

Thank you. I think this is the only way to calculate this estimate of autocorrelation. Can this easily be used to find autocorrelation of data loaded using `x = loadtxt('fn.txt',unpack=True,usecols=[0])` and plotted `pylab.plot(autoCorr, t)`? – 8765674 Jan 15 '13 at 15:19
Yes, something like that should work. Perhaps try `pylab.plot(x, estimate_autocorrelation(x))`... – unutbu Jan 15 '13 at 15:42
If the data is complex-valued, a conjugation should be added to the assertion `assert np.allclose(r, np.array([(x.conj()[:n-k]*x[-(n-k):]).sum() for k in range(n)]))` – icemtel Sep 15 '21 at 13:05

score 17 · Answer 2 · edited Jan 04 '16 at 07:56

I took a part of code from pandas autocorrelation_plot() function. I checked the answers with R and the values are matching exactly.

import numpy
def acf(series):
    n = len(series)
    data = numpy.asarray(series)
    mean = numpy.mean(data)
    c0 = numpy.sum((data - mean) ** 2) / float(n)

    def r(h):
        acf_lag = ((data[:n - h] - mean) * (data[h:] - mean)).sum() / float(n) / c0
        return round(acf_lag, 3)
    x = numpy.arange(n) # Avoiding lag 0 calculation
    acf_coeffs = map(r, x)
    return acf_coeffs

score 12 · Answer 3 · answered May 01 '13 at 18:28

The statsmodels package adds a autocorrelation function that internally uses np.correlate (according to the statsmodels documentation).

See: http://statsmodels.sourceforge.net/stable/generated/statsmodels.tsa.stattools.acf.html#statsmodels.tsa.stattools.acf

score 8 · Answer 4 · edited Jun 20 '20 at 09:12

The method I wrote as of my latest edit is now faster than even scipy.statstools.acf with fft=True until the sample size gets very large.

Error analysis If you want to adjust for biases & get highly accurate error estimates: Look at my code here which implements this paper by Ulli Wolff (or original by UW in Matlab)

Functions Tested

a = correlatedData(n=10000) is from a routine found here
gamma() is from same place as correlated_data()
acorr() is my function below
estimated_autocorrelation is found in another answer
acf() is from from statsmodels.tsa.stattools import acf

Timings

%timeit a0, junk, junk = gamma(a, f=0)                            # puwr.py
%timeit a1 = [acorr(a, m, i) for i in range(l)]                   # my own
%timeit a2 = acf(a)                                               # statstools
%timeit a3 = estimated_autocorrelation(a)                         # numpy
%timeit a4 = acf(a, fft=True)                                     # stats FFT

## -- End pasted text --
100 loops, best of 3: 7.18 ms per loop
100 loops, best of 3: 2.15 ms per loop
10 loops, best of 3: 88.3 ms per loop
10 loops, best of 3: 87.6 ms per loop
100 loops, best of 3: 3.33 ms per loop

Edit... I checked again keeping l=40 and changing n=10000 to n=200000 samples the FFT methods start to get a bit of traction and statsmodels fft implementation just edges it... (order is the same)

## -- End pasted text --
10 loops, best of 3: 86.2 ms per loop
10 loops, best of 3: 69.5 ms per loop
1 loops, best of 3: 16.2 s per loop
1 loops, best of 3: 16.3 s per loop
10 loops, best of 3: 52.3 ms per loop

Edit 2: I changed my routine and re-tested vs. the FFT for n=10000 and n=20000

a = correlatedData(n=200000); b=correlatedData(n=10000)
m = a.mean(); rng = np.arange(40); mb = b.mean()
%timeit a1 = map(lambda t:acorr(a, m, t), rng)
%timeit a1 = map(lambda t:acorr.acorr(b, mb, t), rng)
%timeit a4 = acf(a, fft=True)
%timeit a4 = acf(b, fft=True)

10 loops, best of 3: 73.3 ms per loop   # acorr below
100 loops, best of 3: 2.37 ms per loop  # acorr below
10 loops, best of 3: 79.2 ms per loop   # statstools with FFT
100 loops, best of 3: 2.69 ms per loop # statstools with FFT

Implementation

def acorr(op_samples, mean, separation, norm = 1):
    """autocorrelation of a measured operator with optional normalisation
    the autocorrelation is measured over the 0th axis

    Required Inputs
        op_samples  :: np.ndarray :: the operator samples
        mean        :: float :: the mean of the operator
        separation  :: int :: the separation between HMC steps
        norm        :: float :: the autocorrelation with separation=0
    """
    return ((op_samples[:op_samples.size-separation] - mean)*(op_samples[separation:]- mean)).ravel().mean() / norm

4x speedup can be achieved below. You must be careful to only pass op_samples=a.copy() as it will modify the array a by a-=mean otherwise:

op_samples -= mean
return (op_samples[:op_samples.size-separation]*op_samples[separation:]).ravel().mean() / norm

Sanity Check

Example Error Analysis

This is a bit out of scope but I can't be bothered to redo the figure without the integrated autocorrelation time or integration window calculation. The autocorrelations with errors are clear in the bottom plot

your data samples are far too small how could this be faster then a fft method for large data sizes? You're comparing n^2 to nlogn here. — Kuhlambo, Sep 27 '17 at 14:22
feel free to repeat - the code is all above. I do actually state the FFT routine becomes the faster method at very large `n` so I suspect you just scanned the first sentence. It is possible that the FFT method has a lot of python overhead and error checking each time it is called — Alexander McFarlane, Sep 28 '17 at 09:29
Sorry it's just misleading to state it like that, your only calculating the correlation for 40 different timeshifts. Normally if you have a dataset of 200000 points the part that is interesting performance wise is to look at the correlation function of the whole time. In this case you're dealing with a number of operations on the order of 200000^2 while a FFT method would do around 5*200000. By the way my naive implementation, which is basically identical to yours takes 18.1 ms per loop on identical inputs... — Kuhlambo, Sep 28 '17 at 09:52
Firstly, thanks for taking the time to look at this - I use this code a lot so any improvements are more than welcome! The sample size decreases as you increase the correlation time so measuring *all time* isn't that helpful due to lack of samples at longer times. Can you clarify what method you're timing and benchmark it with another ? (Different machines etc) To clarify, I agree that FFT is faster at scale. However for a lot of everyday cases where the sample `n<10000` and we look at a small auto-correlation time (typical of financial data) this method seems quite good ? — Alexander McFarlane, Sep 28 '17 at 10:17
I've edited the opening sentence as you're right that it implies my routine is quicker for all sample sizes, when it's just faster at a limit. It might be interesting to call the underlying c-func directly (avoiding the python overhead) from `statstools` and compare? — Alexander McFarlane, Sep 28 '17 at 10:28
I don't really have an opinion on that. Two more things, first by using .ravel() I am not shure that you achieve anything, and my adding a .mean() to it you divide by len(op_samples), just so you're aware of this it is not the factor 1/(n-k) from the definition OP gave. secondly if you replace (op_samples[:op_samples.size-separation]*op_samples[separation:]).ravel().mean() by np.dot(op_samples[:op_samples.size-separation],op_samples[separation:]) you get further speedups, as far as I can see that was the difference to my naive implementation of autocorrelation. — Kuhlambo, Sep 28 '17 at 10:47
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/155498/discussion-between-pindakaas-and-alexander-mcfarlane). — Kuhlambo, Sep 28 '17 at 10:53
^ will respond on chat tonight - corp firewall blocks chat I'm afraid — Alexander McFarlane, Sep 28 '17 at 10:54
yep that was it just compare a = linspace(0,100,1000000): %timeit np.mean(a*a) to %timeit np.dot(a,a)/len(a), on my machine 44.4 ms vs 8.07 ms — Kuhlambo, Sep 28 '17 at 11:04

score 2 · Answer 5 · answered Dec 17 '17 at 23:46

I found this got the expected results with just a slight change:

def estimated_autocorrelation(x):
    n = len(x)
    variance = x.var()
    x = x-x.mean()
    r = N.correlate(x, x, mode = 'full')
    result = r/(variance*n)
    return result

Testing against Excel's autocorrelation results.