Numpy-Blas calls are not always the fastest possibility
On problems where you have to calculate lots of inverses, eigenvalues, dot-products of small 3x3 matrices or similar cases, numpy-MKL which I use can often be outperformed by quite a margin.
This external Blas routines are usually made for problems with larger matrices, for smaller ones you can write out a standard algorithm or take a look at eg. Intel IPP.
Please keep also in mind that Numpy uses C-ordered arrays by default (last dimension changes fastest).
For this example I took the code from Matrix inversion (3,3) python - hard coded vs numpy.linalg.inv and modified it a bit.
import numpy as np
import numba as nb
import time
@nb.njit(fastmath=True)
def inversion(m):
minv=np.empty(m.shape,dtype=m.dtype)
for i in range(m.shape[0]):
determinant_inv = 1./(m[i,0]*m[i,4]*m[i,8] + m[i,3]*m[i,7]*m[i,2] + m[i,6]*m[i,1]*m[i,5] - m[i,0]*m[i,5]*m[i,7] - m[i,2]*m[i,4]*m[i,6] - m[i,1]*m[i,3]*m[i,8])
minv[i,0]=(m[i,4]*m[i,8]-m[i,5]*m[i,7])*determinant_inv
minv[i,1]=(m[i,2]*m[i,7]-m[i,1]*m[i,8])*determinant_inv
minv[i,2]=(m[i,1]*m[i,5]-m[i,2]*m[i,4])*determinant_inv
minv[i,3]=(m[i,5]*m[i,6]-m[i,3]*m[i,8])*determinant_inv
minv[i,4]=(m[i,0]*m[i,8]-m[i,2]*m[i,6])*determinant_inv
minv[i,5]=(m[i,2]*m[i,3]-m[i,0]*m[i,5])*determinant_inv
minv[i,6]=(m[i,3]*m[i,7]-m[i,4]*m[i,6])*determinant_inv
minv[i,7]=(m[i,1]*m[i,6]-m[i,0]*m[i,7])*determinant_inv
minv[i,8]=(m[i,0]*m[i,4]-m[i,1]*m[i,3])*determinant_inv
return minv
#I was to lazy to modify the code from the link above more thoroughly
def inversion_3x3(m):
m_TMP=m.reshape(m.shape[0],9)
minv=inversion(m_TMP)
return minv.reshape(minv.shape[0],3,3)
#Testing
A = np.random.rand(1000000,3,3)
#Warmup to not measure compilation overhead on the first call
#You may also use @nb.njit(fastmath=True,cache=True) but this has also about 0.2s
#overhead on fist call
Ainv = inversion_3x3(A)
t1=time.time()
Ainv = inversion_3x3(A)
print(time.time()-t1)
t1=time.time()
Ainv2 = np.linalg.inv(A)
print(time.time()-t1)
print(np.allclose(Ainv2,Ainv))
Performance
np.linalg.inv: 0.36 s
inversion_3x3: 0.031 s