0

I'm struggling to come up with a non-obfuscating, efficient way of using numpy to compute a self correlation function in a set of 3D vectors.

I have a set of vectors in a 3d space, saved in an array

a = array([[ 0.24463039,  0.58350592,  0.77438803],
       [ 0.30475903,  0.73007075,  0.61165238],
       [ 0.17605543,  0.70955876,  0.68229821],
       [ 0.32425896,  0.57572195,  0.7506    ],
       [ 0.24341381,  0.50183697,  0.83000565],
       [ 0.38364726,  0.62338687,  0.68132488]])

their self correlation function is defined as enter image description here

in case the image above doesn't stay available, the formula is also printed below: C(t,{v}n) = \frac 1{n-t}\sum{i=0}^{n-1-t}\vec v_i\cdot\vec v_{i+t}

I'm struggling to code this up in an efficient way, non-obfuscating way. I can compute this expliticly with two nested for loops, but that's slow. There is a fast way of doing it by using one of the embedded functions in numpy, but they seem to be using an entirely different definition of correlation function. A similar problem has been solved here, How can I use numpy.correlate to do autocorrelation? but that doesn't handle vectors. Do you know how can I solve this?

Ferdinando Randisi
  • 4,068
  • 6
  • 32
  • 43

4 Answers4

2

The NumPy routines are for 1D arrays. As a "minimal" improvement, use a vectorized operation for the normalisation step (use of np.arange in the before-to-last line):

def vector_autocorrelate(t_array):
  n_vectors = len(t_array)
  # correlate each component indipendently
  acorr = np.array([np.correlate(t_array[:,i],t_array[:,i],'full') for i in xrange(3)])[:,n_vectors-1:]
  # sum the correlations for each component
  acorr = np.sum(acorr, axis = 0)
  # divide by the number of values actually measured and return
  acorr /= (n_vectors - np.arange(n_vectors))
  return acorr

For larger array sizes, you should consider using the Fourier Transform algorithm for correlation. Check out the examples of the library tidynamics if you are interested in that (disclaimer: I wrote the library, it depends only on NumPy).

For reference, here are the timings for a NumPy-code (that I wrote for testing), your code vector_autocorrelate and tidynamics.

size = [2**i for i in range(4, 17, 2)]

np_time = []
ti_time = []
va_time = []
for s in size:
    data = np.random.random(size=(s, 3))
    t0 = time.time()
    correlate = np.array([np.correlate(data[:,i], data[:,i], mode='full') for i in range(data.shape[1])])[:,:s]
    correlate = np.sum(correlate, axis=0)/(s-np.arange(s))
    np_time.append(time.time()-t0)
    t0 = time.time()
    correlate = tidynamics.acf(data)
    ti_time.append(time.time()-t0)
    t0 = time.time()
    correlate = vector_autocorrelate(data)
    va_time.append(time.time()-t0)

You can see the results:

print("size", size)
print("np_time", np_time)
print("va_time", va_time)
print("ti_time", ti_time)

size [16, 64, 256, 1024, 4096, 16384, 65536]

np_time [0.00023794174194335938, 0.0002703666687011719, 0.0002713203430175781, 0.001544952392578125, 0.0278470516204834, 0.36094141006469727, 6.922360420227051]

va_time [0.00021696090698242188, 0.0001690387725830078, 0.000339508056640625, 0.0014629364013671875, 0.024930953979492188, 0.34442687034606934, 7.005630731582642]

ti_time [0.0011148452758789062, 0.0008449554443359375, 0.0007512569427490234, 0.0010488033294677734, 0.0026645660400390625, 0.007939338684082031, 0.048232316970825195]

or plot them

plt.plot(size, np_time)
plt.plot(size, va_time)
plt.plot(size, ti_time)
plt.loglog()

For anything but very small data series, the "N**2" algorithm is unusable.

Pierre de Buyl
  • 7,074
  • 2
  • 16
  • 22
  • Thanks! I'm pretty sure that numpy does use the fourier transform method. Could you benchmark the solutions with and without using tidynamics to see how fast they are with comparison with each other? Also I did like the vectorisation in the normalisation, it's quite sleek. – Ferdinando Randisi Feb 18 '18 at 12:43
  • 1
    Nothing in the docs of NumPy suggests that they use the FFT. The source code uses plain loops, and that scales as N**2. I add the timings. – Pierre de Buyl Feb 18 '18 at 21:20
  • Agreed, FFT is the right way to go for this problem. – Robert Dodier Feb 18 '18 at 21:44
  • Cool stuff! Thanks much :) – Ferdinando Randisi Feb 19 '18 at 22:59
  • I would love to, but currently the fast correlation vector function doesn't do what the question asks. tidynamics.acf(data) seems to only compute the correlation of each line, as opposed to compute the correlation function of the vectors as defined. That would answer the question as it was formulated and the answer would be more than worthy to be accepted. – Ferdinando Randisi Feb 20 '18 at 13:33
  • "For D-dimensional time series, a sum is performed on the last dimension." from the doc for [acf](http://lab.pdebuyl.be/tidynamics/correlation.html#tidynamics.acf). Is it not what you ask? – Pierre de Buyl Feb 20 '18 at 15:14
1

Here's my result. It is slower (about 4x) and delivers other results. Why do I post it anyway? I thought it's worth to see how to measure and what's the difference. If - in addition - anybody finds the reason for the different results, I'd be more then happy.

So, here's my solution:

%timeit [np.mean([np.dot(a[t], a[i+t]) for i in range(len(a)-t)]) for t in range(len(a))]

Result: 95.2 µs ± 3.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In comparison, your solution is about 4x faster:

%timeit vector_autocorrelate(a)

delivers: 24.8 µs ± 1.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

jboi
  • 11,324
  • 4
  • 36
  • 43
  • Mine is faster because it piggybacks on numpy's correlate function, which uses fourier transform to make computing the correlation an O(N) operation, while the 2 nested for loops make computing the entire function an O(N^2) operation. The discrepancy might come from the fact that each value of t has a different number of values to average on, and mean might not take that into account. EDIT: actually np.mean seems fine, I don't know what the problem is. But I check my implementation with another one similar to yours, and it did work, so I'm puzzled. – Ferdinando Randisi Feb 18 '18 at 12:36
1

Hi I faced a imilar issue. Here is my idea

def fast_vector_correlation(M):
    n_row = M.shape[0]
    dot_mat = M.dot(M.T)
    corr = [np.trace(dot_mat,offset=x) for x in range(n_row)]
    corr/=(n_row-np.arange(n_row))
    return corr

The idea is that dot_mat contains all the scalar product between the row vectors. To compute the correlation at different t values you have just to sum the diagonals (of the upper right riangular part), as show in the picture.

0

I'm posting the answer here in case someone else will need it, since it took me quite some time to figure out a viable approach. I ended up solving this by defining the following function

def vector_autocorrelate(t_array):
  n_vectors = len(t_array)
  # correlate each component indipendently
  acorr = np.array([np.correlate(t_array[:,i],t_array[:,i],'full') for i in xrange(3)])[:,n_vectors-1:]
  # sum the correlations for each component
  acorr = np.sum(acorr, axis = 0)
  # divide by the number of values actually measured and return
  acorr = np.array( [ val / (n_vectors - i) for i,val in enumerate(acorr)])
  return acorr

If anybody has a better idea, I would really like to hear that since I think the current one is still not as compact as it should be. It's better than nothing though, which is why I'm posting it here.

Ferdinando Randisi
  • 4,068
  • 6
  • 32
  • 43
  • another question. I tried to find a faster solution and found a one-liner that is slower (well...) It also delivers other results. I'm pretty sure, I'm correct. Can you please double check your results? – jboi Feb 17 '18 at 22:28