37

I would like to perform Autocorrelation on the signal shown below. The time between two consecutive points is 2.5ms (or a repetition rate of 400Hz).

enter image description here

This is the equation for estimating autoacrrelation that I would like to use (Taken from http://en.wikipedia.org/wiki/Autocorrelation, section Estimation):

enter image description here

What is the simplest method of finding the estimated autocorrelation of my data in python? Is there something similar to numpy.correlate that I can use?

Or should I just calculate the mean and variance?


Edit:

With help from unutbu, I have written:

from numpy import *
import numpy as N
import pylab as P

fn = 'data.txt'
x = loadtxt(fn,unpack=True,usecols=[1])
time = loadtxt(fn,unpack=True,usecols=[0]) 

def estimated_autocorrelation(x):
    n = len(x)
    variance = x.var()
    x = x-x.mean()
    r = N.correlate(x, x, mode = 'full')[-n:]
    #assert N.allclose(r, N.array([(x[:n-k]*x[-(n-k):]).sum() for k in range(n)]))
    result = r/(variance*(N.arange(n, 0, -1)))
    return result

P.plot(time,estimated_autocorrelation(x))
P.xlabel('time (s)')
P.ylabel('autocorrelation')
P.show()
Community
  • 1
  • 1
8765674
  • 1,234
  • 4
  • 17
  • 32
  • http://stackoverflow.com/questions/4503325/autocorrelation-of-a-multidimensional-array-in-numpy – Abid Rahman K Jan 12 '13 at 19:50
  • I wanted to speak more specifically about the estimated autocorrelation equation. – 8765674 Jan 12 '13 at 20:04
  • See also: http://stackoverflow.com/questions/643699/how-can-i-use-numpy-correlate-to-do-autocorrelation and http://stackoverflow.com/questions/12269834/is-there-any-numpy-autocorrellation-function-with-standardized-output – amcnabb Oct 10 '13 at 18:38

5 Answers5

34

I don't think there is a NumPy function for this particular calculation. Here is how I would write it:

def estimated_autocorrelation(x):
    """
    http://stackoverflow.com/q/14297012/190597
    http://en.wikipedia.org/wiki/Autocorrelation#Estimation
    """
    n = len(x)
    variance = x.var()
    x = x-x.mean()
    r = np.correlate(x, x, mode = 'full')[-n:]
    assert np.allclose(r, np.array([(x[:n-k]*x[-(n-k):]).sum() for k in range(n)]))
    result = r/(variance*(np.arange(n, 0, -1)))
    return result

The assert statement is there to both check the calculation and to document its intent.

When you are confident this function is behaving as expected, you can comment-out the assert statement, or run your script with python -O. (The -O flag tells Python to ignore assert statements.)

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thank you. I think this is the only way to calculate this estimate of autocorrelation. Can this easily be used to find autocorrelation of data loaded using `x = loadtxt('fn.txt',unpack=True,usecols=[0])` and plotted `pylab.plot(autoCorr, t)`? – 8765674 Jan 15 '13 at 15:19
  • Yes, something like that should work. Perhaps try `pylab.plot(x, estimate_autocorrelation(x))`... – unutbu Jan 15 '13 at 15:42
  • If the data is complex-valued, a conjugation should be added to the assertion `assert np.allclose(r, np.array([(x.conj()[:n-k]*x[-(n-k):]).sum() for k in range(n)]))` – icemtel Sep 15 '21 at 13:05
17

I took a part of code from pandas autocorrelation_plot() function. I checked the answers with R and the values are matching exactly.

import numpy
def acf(series):
    n = len(series)
    data = numpy.asarray(series)
    mean = numpy.mean(data)
    c0 = numpy.sum((data - mean) ** 2) / float(n)

    def r(h):
        acf_lag = ((data[:n - h] - mean) * (data[h:] - mean)).sum() / float(n) / c0
        return round(acf_lag, 3)
    x = numpy.arange(n) # Avoiding lag 0 calculation
    acf_coeffs = map(r, x)
    return acf_coeffs
Akshay Damle
  • 1,220
  • 3
  • 18
  • 31
Kathirmani Sukumar
  • 10,445
  • 5
  • 33
  • 34
12

The statsmodels package adds a autocorrelation function that internally uses np.correlate (according to the statsmodels documentation).

See: http://statsmodels.sourceforge.net/stable/generated/statsmodels.tsa.stattools.acf.html#statsmodels.tsa.stattools.acf

hamogu
  • 638
  • 6
  • 13
8

The method I wrote as of my latest edit is now faster than even scipy.statstools.acf with fft=True until the sample size gets very large.

Error analysis If you want to adjust for biases & get highly accurate error estimates: Look at my code here which implements this paper by Ulli Wolff (or original by UW in Matlab)

Functions Tested

  • a = correlatedData(n=10000) is from a routine found here
  • gamma() is from same place as correlated_data()
  • acorr() is my function below
  • estimated_autocorrelation is found in another answer
  • acf() is from from statsmodels.tsa.stattools import acf

Timings

%timeit a0, junk, junk = gamma(a, f=0)                            # puwr.py
%timeit a1 = [acorr(a, m, i) for i in range(l)]                   # my own
%timeit a2 = acf(a)                                               # statstools
%timeit a3 = estimated_autocorrelation(a)                         # numpy
%timeit a4 = acf(a, fft=True)                                     # stats FFT

## -- End pasted text --
100 loops, best of 3: 7.18 ms per loop
100 loops, best of 3: 2.15 ms per loop
10 loops, best of 3: 88.3 ms per loop
10 loops, best of 3: 87.6 ms per loop
100 loops, best of 3: 3.33 ms per loop

Edit... I checked again keeping l=40 and changing n=10000 to n=200000 samples the FFT methods start to get a bit of traction and statsmodels fft implementation just edges it... (order is the same)

## -- End pasted text --
10 loops, best of 3: 86.2 ms per loop
10 loops, best of 3: 69.5 ms per loop
1 loops, best of 3: 16.2 s per loop
1 loops, best of 3: 16.3 s per loop
10 loops, best of 3: 52.3 ms per loop

Edit 2: I changed my routine and re-tested vs. the FFT for n=10000 and n=20000

a = correlatedData(n=200000); b=correlatedData(n=10000)
m = a.mean(); rng = np.arange(40); mb = b.mean()
%timeit a1 = map(lambda t:acorr(a, m, t), rng)
%timeit a1 = map(lambda t:acorr.acorr(b, mb, t), rng)
%timeit a4 = acf(a, fft=True)
%timeit a4 = acf(b, fft=True)

10 loops, best of 3: 73.3 ms per loop   # acorr below
100 loops, best of 3: 2.37 ms per loop  # acorr below
10 loops, best of 3: 79.2 ms per loop   # statstools with FFT
100 loops, best of 3: 2.69 ms per loop # statstools with FFT

Implementation

def acorr(op_samples, mean, separation, norm = 1):
    """autocorrelation of a measured operator with optional normalisation
    the autocorrelation is measured over the 0th axis

    Required Inputs
        op_samples  :: np.ndarray :: the operator samples
        mean        :: float :: the mean of the operator
        separation  :: int :: the separation between HMC steps
        norm        :: float :: the autocorrelation with separation=0
    """
    return ((op_samples[:op_samples.size-separation] - mean)*(op_samples[separation:]- mean)).ravel().mean() / norm

4x speedup can be achieved below. You must be careful to only pass op_samples=a.copy() as it will modify the array a by a-=mean otherwise:

op_samples -= mean
return (op_samples[:op_samples.size-separation]*op_samples[separation:]).ravel().mean() / norm

Sanity Check

enter image description here

Example Error Analysis

This is a bit out of scope but I can't be bothered to redo the figure without the integrated autocorrelation time or integration window calculation. The autocorrelations with errors are clear in the bottom plot enter image description here

Community
  • 1
  • 1
Alexander McFarlane
  • 10,643
  • 9
  • 59
  • 100
  • your data samples are far too small how could this be faster then a fft method for large data sizes? You're comparing n^2 to nlogn here. – Kuhlambo Sep 27 '17 at 14:22
  • feel free to repeat - the code is all above. I do actually state the FFT routine becomes the faster method at very large `n` so I suspect you just scanned the first sentence. It is possible that the FFT method has a lot of python overhead and error checking each time it is called – Alexander McFarlane Sep 28 '17 at 09:29
  • Sorry it's just misleading to state it like that, your only calculating the correlation for 40 different timeshifts. Normally if you have a dataset of 200000 points the part that is interesting performance wise is to look at the correlation function of the whole time. In this case you're dealing with a number of operations on the order of 200000^2 while a FFT method would do around 5*200000. By the way my naive implementation, which is basically identical to yours takes 18.1 ms per loop on identical inputs... – Kuhlambo Sep 28 '17 at 09:52
  • Firstly, thanks for taking the time to look at this - I use this code a lot so any improvements are more than welcome! The sample size decreases as you increase the correlation time so measuring *all time* isn't that helpful due to lack of samples at longer times. Can you clarify what method you're timing and benchmark it with another ? (Different machines etc) To clarify, I agree that FFT is faster at scale. However for a lot of everyday cases where the sample `n<10000` and we look at a small auto-correlation time (typical of financial data) this method seems quite good ? – Alexander McFarlane Sep 28 '17 at 10:17
  • I've edited the opening sentence as you're right that it implies my routine is quicker for all sample sizes, when it's just faster at a limit. It might be interesting to call the underlying c-func directly (avoiding the python overhead) from `statstools` and compare? – Alexander McFarlane Sep 28 '17 at 10:28
  • I don't really have an opinion on that. Two more things, first by using .ravel() I am not shure that you achieve anything, and my adding a .mean() to it you divide by len(op_samples), just so you're aware of this it is not the factor 1/(n-k) from the definition OP gave. secondly if you replace (op_samples[:op_samples.size-separation]*op_samples[separation:]).ravel().mean() by np.dot(op_samples[:op_samples.size-separation],op_samples[separation:]) you get further speedups, as far as I can see that was the difference to my naive implementation of autocorrelation. – Kuhlambo Sep 28 '17 at 10:47
  • 2
    Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/155498/discussion-between-pindakaas-and-alexander-mcfarlane). – Kuhlambo Sep 28 '17 at 10:53
  • ^ will respond on chat tonight - corp firewall blocks chat I'm afraid – Alexander McFarlane Sep 28 '17 at 10:54
  • yep that was it just compare a = linspace(0,100,1000000): %timeit np.mean(a*a) to %timeit np.dot(a,a)/len(a), on my machine 44.4 ms vs 8.07 ms – Kuhlambo Sep 28 '17 at 11:04
2

I found this got the expected results with just a slight change:

def estimated_autocorrelation(x):
    n = len(x)
    variance = x.var()
    x = x-x.mean()
    r = N.correlate(x, x, mode = 'full')
    result = r/(variance*n)
    return result

Testing against Excel's autocorrelation results.