Cross-correlation of non-periodic function with NumPy

Question

I have two data sets that I'm trying to cross-correlate. They look similar to the arctan function, so I've been using it as a model to work out how to do my signal processing.

x = linspace(-15, 15, 2**13)
f1 = arctan(x)
f2 = arctan(x + 2)

enter image description here

The question I need to answer is, how much do I need to shift the green signal to get it to (mostly) overlap with the blue one? I thought it would be as simple as finding the maximum in the cross-correlation function of f1 and f2, and I broadly followed the advice here: How to correlate two time series with gaps and different time bases?. This is what I've been trying

c = correlate(f1, f2, 'full')
s = arange(1-2**13, 2**13)
dx = 30/2**13
shift = s[c.argmax()]*dx

I would expect shift to equal more or less exactly 2, but in fact it's only 0.234. This doesn't make any sense to me; I've found the x-coordinate of the maximum in cross-correlation, which should be found where there is maximal overlap of the two signals.

Any ideas on how to calculate this quantity for this kind of function?

EDIT: I should add, for my real data, all of the values will be between zero and one

EDIT EDIT: The following functions are actually more like my real data:

x = linspace(-15, 15, 400)
f1 = (arctan(-x) + pi/2) / pi
f2 = (arctan(-x + 2) + pi/2) / pi

enter image description here

So using the formula given here: http://paulbourke.net/miscellaneous/correlate/ I can write a cross-correlation function that pads the data to add ones to the left and zeros to the right:

def xcorr(x, y);
    mx = x.mean()
    my = y.mean()
    sx = x.std()
    sy = y.std()
    r = zeros(2*len(x))

    for d in range(-len(x), len(x)):
        csum = 0
        for i in range(0, len(x):
            yindx = i - d
            if i - d < 0:
                yval = 1
            elif i - d >= len(x):
                yval = 0
            else:
                yval = y[yindx]
            csum += (x[i] - mx) * (yval - my)
        r[d + len(x)] = csum / (sx * sy)
    return r

With this function, I can now do

c = xcorr(f1, f2)
s = arange(-400, 400)
dx = 30/400
shift = s[c.argmax()] * dx

Which comes out to 2.025, which is as close as you can get to 2 with this precision. So it looks like Jamie was correct, the issue is in how numpy correlate does the padding of signals.

So, obviously my xcorr function is really slow as it stands. The question now is, is there a way I can make NumPy do a similar thing, or should I just proceed to write my algorithm in C using ctypes?

Interesting- if I add 1.5 to f1 and f2 to make them always positive, the shift becomes zero... — KJ Tsanaktsidis, Apr 13 '13 at 15:45
@Eike then you should make your comment an answer. Maybe add a link for further explainations. — Stephane Rolland, Apr 13 '13 at 16:08
Unfortunately, a fast check at http://en.wikipedia.org/wiki/Cross-correlation doesn't bring up any specification about absolute value neede for the input signals. I don't know if the numpy.correlate function has such limitations. — Stephane Rolland, Apr 13 '13 at 16:27
They even give details about the effect of positive and negative areas in the signal in the Explanation paragraph : **"This is because when peaks (positive areas) are aligned, they make a large contribution to the integral. Similarly, when troughs (negative areas) align, they also make a positive contribution to the integral because the product of two negative numbers is positive."** — Stephane Rolland, Apr 13 '13 at 16:29
@StephaneRolland It is padding that is ruining the OP's efforts, not a non-exisitng requirement for positiveness. — Jaime, Apr 13 '13 at 16:43

score 3 · Answer 1 · answered Apr 14 '13 at 04:52

As people have pointed out, the cross-correlation is confounded by the padding beyond the data.

Although it feels like you're throwing away good data, it's often better to just trim the data set so that the correlation can be done without assumptions (at least when compared to the alternative of correlating actual data with made up data for the padding).

x = linspace(-15, 15, 4000)
f1 = (arctan(-x) + pi/2) / pi
f2 = (arctan(-x + 2) + pi/2) / pi

L4 = int(len(f2)/8)
sf2 = f2[L4:-L4]

c = correlate(f1-mean(f1), sf2-mean(f1), 'same')
print "peak correlation occurs at:", x[argmax(c)]  # -2.02925731433

plt.plot(x, c)
plt.show()

enter image description here

Also, though, I'm not sure that xcorr is the best approach here. How about just summing the distance between the y-axis values for different shifts and taking the minimum, which gets away from all the issues of where the zero is, etc.

score 1 · Answer 2 · answered Apr 13 '13 at 16:47

What the maximum of the cross-correlation finds is the shift at which the sum of the products of your two signals is a maximum. One would think that the cross correlation of two signals, one a time shift of the other, would be maximum at the shift. And while this is true for infinite signals, numpy zero-pads your two signals, so with two equally long signals, you only have a summation over 2**13 non-zero values for a shift of zero, and the higher values from a better matching of the two functions, are offset but there being less non-zero values.

If your signals were 0 at +/- infinity, this wouldn´t be too bad. As it is, I haven't been able to come up with any reasonable solution using cross-correlation.

OK, if I calculate the cross-correlation manually, I can do whatever I like with my padding. I'll amend my question to include a function that's a bit more like my data and an algorithm to do the padding I want manually... — KJ Tsanaktsidis, Apr 14 '13 at 00:16

Cross-correlation of non-periodic function with NumPy

2 Answers2