1

I am writing numpy code to calculate autocorrelation. I am trying to improve the performance of my implementation.

I have tried two approached: matrix multiplication on an array views and dot product on array slices in a for loop. To my surprise, the first approach seems much slower.

The function takes a vector x and maximum shift k, and returns the dot product of the vector with a shifted vector for every shift i.

def acorr_aview(x, k):
    return np.dot([x[i:-k+i] for i in range(k)], x[:-k])

def acorr_loop(x, k):
    return np.array([np.dot(x[i:-k+i],x[:-k]) for i in range(k)])

I was expecting acorr_aview to have better performance due to utilizing matrix multiplication, but the opposite seems the be the case.

x = np.random.randn(10000)
k = 100

%timeit acorr_aview(x,k)
%timeit acorr_loop(x,k)
3.32 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
753 µs ± 33.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Why is acorr_loop much faster? Thanks.

Edit: For comparison:

A = np.random.randn(9900,100)
v = np.random.randn(100)

%timeit np.dot(A,v)
%timeit np.array([np.dot(a,v) for a in A])
1.08 ms ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
12.4 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
RinaldoB
  • 11
  • 2

1 Answers1

1

In the first case, you have a (100, 9900) dotted with a (9900,). In the second, you dot a (9900,) with a (9900,) 100 times.

There's a trade off between memory management costs in the first case, and iteration costs in the second. Other SO questions have observed that a modest number of iterations on a large problem can be faster than one calculation on a much larger problem.

Why is B = numpy.dot(A,x) so much slower looping through doing B[i,:,:] = numpy.dot(A[i,:,:],x) )?

There's another thing going on in your case - the time required to make that larger array:

In [366]: timeit np.array([x[i:-k+i] for i in range(k)])                             
2.62 ms ± 22.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [367]: timeit np.dot(np.array([x[i:-k+i] for i in range(k)]),x[:-k])              
3.6 ms ± 147 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [368]: %%timeit xx = np.array([x[i:-k+i] for i in range(k)]) 
     ...: np.dot(xx, x[:-k])                                                                  
1.05 ms ± 9.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The larger dot is still taking more time than the 100 smaller ones, but constructing that xx is a bigger job.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • It's not as simple as that. See my edit for comparison. Actually, `acorr_loop` is even faster than a plain matrix multiplication, so there must be some optimization going on behind the scenes. Also, no array is actually constructed (no memory copied) in `acorr_aview`, that's just a list of array views (slices). – RinaldoB May 03 '19 at 17:36
  • That list of slices has to be turned into an array before dot can use it (pass it to BLAS). – hpaulj May 03 '19 at 17:58
  • You comparison case is more loops on a smaller dot. – hpaulj May 03 '19 at 18:00
  • Thanks In your example, any explanation why converting the slices to an array and multiplying is actually faster than doing any of those two by itself? – RinaldoB May 05 '19 at 18:08