Multidimensional Numpy dot product slow compared to for loop

Question

Where numpy with mkl optimizations seems to be quite fast for small matrix multiplications, it is quite slow for higher dimensional tensor contractions. For example, using functions such as einsum is much slower than looping through two-dimensional sub matrices of the high dimensional complex tensor:

complexarray = np.random.rand(2867, 100, 40) + 1j*np.random.rand(2867, 100, 40)
complexarray_cc = complexarray.conj()


%timeit np.einsum('ijk, ijm -> jkm', complexarray_cc, complexarray)

1 loop, best of 3: 2.45 s per loop

%%timeit 
tmp_product = np.zeros((100,40,40), dtype=complex)
for l in range(100):
    tmp_product[l,:,:] = np.dot((complexarray_cc[:,l,:].T), complexarray[:,l,:])

10 loops, best of 3: 120 ms per loop

Looking at

How to get faster code than numpy.dot for matrix multiplication?

and

Why is B = numpy.dot(A,x) so much slower looping through doing B[i,:,:] = numpy.dot(A[i,:,:],x) )?

which describes a similar problem, it seems that the arrays are just too large to fit into memory and therefore some slow matrix multiplication routines are used. Could this be the case here?

Naively, my approach was to just replace any for loop in my code by vectorized numpy/array operations. It seems this is not arbitrarily scalable as the arrays become larger.

Is there no other option to increase the operation of np.dot and similar operations? Or rephrased: Up to which point is it recommended to just loop over your smaller subarrays compared to using numpy functions on very large arrays?

`Up to which point is it recommended to just loop over your smaller subarrays compared to using numpy functions on very large arrays` - I would say it would depend to a good extent on the shapes of the input arrays. For example, if you had the length of the second axis to be huge number as compared to the length of first axis, I doubt that loopy dot product would still be a winner. — Divakar, Dec 09 '16 at 20:00
`einsum` isn't using any `mkl` optimizations that I know of. It uses `nditer` to iterate over a `(2867,100,40,40)` space. — hpaulj, Dec 09 '16 at 20:16
@hpaulj is correct. The einsum code is really quite naive in many respects while `np.dot` will call DGEMM (threaded, vectorized, ect). Usually python loops are to be avoided with numpy, here they are the most optimal way of doing things unless the ratio between the size of your looping index and `dot` size changes dramatically. — Daniel, Dec 09 '16 at 21:37
Sometimes I hit a sweet spot with einsum where it is blazingly fast, and even faster than np.dot, then again, sometimes it is slow as in the example above. Are there any alternatives to numpy than using for loops for large arrays with more than 3 dimensions? — BPresent, Dec 10 '16 at 22:34

Multidimensional Numpy dot product slow compared to for loop

0 Answers0