Where numpy with mkl optimizations seems to be quite fast for small matrix multiplications, it is quite slow for higher dimensional tensor contractions. For example, using functions such as einsum is much slower than looping through two-dimensional sub matrices of the high dimensional complex tensor:
complexarray = np.random.rand(2867, 100, 40) + 1j*np.random.rand(2867, 100, 40)
complexarray_cc = complexarray.conj()
%timeit np.einsum('ijk, ijm -> jkm', complexarray_cc, complexarray)
1 loop, best of 3: 2.45 s per loop
%%timeit
tmp_product = np.zeros((100,40,40), dtype=complex)
for l in range(100):
tmp_product[l,:,:] = np.dot((complexarray_cc[:,l,:].T), complexarray[:,l,:])
10 loops, best of 3: 120 ms per loop
Looking at
How to get faster code than numpy.dot for matrix multiplication?
and
Why is B = numpy.dot(A,x) so much slower looping through doing B[i,:,:] = numpy.dot(A[i,:,:],x) )?
which describes a similar problem, it seems that the arrays are just too large to fit into memory and therefore some slow matrix multiplication routines are used. Could this be the case here?
Naively, my approach was to just replace any for loop in my code by vectorized numpy/array operations. It seems this is not arbitrarily scalable as the arrays become larger.
Is there no other option to increase the operation of np.dot and similar operations? Or rephrased: Up to which point is it recommended to just loop over your smaller subarrays compared to using numpy functions on very large arrays?