5

Say the two series are:

x = [4,4,4,4,6,8,10,8,6,4,4,4,4,4,4,4,4,4,4,4,4,4,4]
y = [4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,6,8,10,8,6,4,4]

Series x clearly lags y by 12 time periods. However, using the following code as suggested in Python cross correlation:

import numpy as np
c = np.correlate(x, y, "full")
lag = np.argmax(c) - c.size/2

leads to an incorrect lag of -0.5.
What's wrong here?

1 Answers1

7

If you want to do it the easy way you should simply use scipy correlation_lags

Also, remember to subtract the mean from the inputs.

import numpy as np
from scipy import signal
x = [4,4,4,4,6,8,10,8,6,4,4,4,4,4,4,4,4,4,4,4,4,4,4]
y = [4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,6,8,10,8,6,4,4]
correlation = signal.correlate(x-np.mean(x), y - np.mean(y), mode="full")
lags = signal.correlation_lags(len(x), len(y), mode="full")
lag = lags[np.argmax(abs(correlation))]

This gives lag=-12, that is the difference between the index of the first six in x and in y, if you swap inputs it gives +12

Edit

Why to subtract the mean

If the signals have non-zero mean the terms at the center of the correlation will become larger, because there you have a larger support sample to compute the correlation. Furthermore, for very large data, subtracting the mean makes the calculations more accurate.

Here I illustrate what would happen if the mean was not subtracted for this example.

plt.plot(abs(correlation))
plt.plot(abs(signal.correlate(x, y, mode="full")))
plt.plot(abs(signal.correlate(np.ones_like(x)*np.mean(x), np.ones_like(y)*np.mean(y))))
plt.legend(['subtracting mean', 'constant signal', 'keeping the mean'])

enter image description here

Notice that the maximum on the blue curve (at 10) does not coincide with the maximum of the orange curve.

Bob
  • 13,867
  • 1
  • 5
  • 27
  • why do you need to subtract the mean when calculating the correlation? – Py-ser May 13 '22 at 13:19
  • If the two signals have the same length the number of terms in each will be a triangle shape, that will probably place the maximum correlation at the center. – Bob May 15 '22 at 16:18
  • Added one plot to help there. – Bob May 15 '22 at 16:27
  • Thank you. You say 'the terms at the center of the correlation will become larger'. Why is this not reported in any official documentation? Do you have an official link that elaborates more on that and the use of the mean? – Py-ser May 16 '22 at 08:17
  • They give the the definition [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.correlate.html#:~:text=The%20correlation%20z%20of%20two%20d%2Ddimensional%20arrays%20x%20and%20y%20is%20defined%20as). I use this to calculate an unnormalized [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) version for all the possible shifts. – Bob May 16 '22 at 09:13
  • Sure, I just can't get from the definition how the correlation becomes stronger towards the center of the array, nor there is any mention to the mean subtraction. – Py-ser May 16 '22 at 09:22
  • So maybe maybe need to post a specific question for your specific doubts. I will be happy to give an answer with more details if I can. – Bob May 16 '22 at 09:37
  • I have done it [here](https://stackoverflow.com/questions/72230482/signal-correlation-shift-and-lag-correct-only-if-arrays-subtracted-by-mean) following this thread. – Py-ser May 16 '22 at 09:39