If the 'dot' dimensions are at the end, matmul
will work
Comparing 3 methods:
In [252]: A=np.arange(5*5*10*10*10).reshape((5,5,10,10,10))
In [253]: B=np.arange(5*10*10*10).reshape((5,10,10,10))
In [254]: C=np.sum(A*B, axis=1)
In [255]: D=np.einsum('ijklm,jklm->iklm',A,B)
In [256]: E = (A.transpose(2,3,4,0,1)@B.transpose(1,2,3,0)[...,None])[...,0].transpose(3,0,1,2)
All the transposes make the arrays into into (....,5,5)
and (...,5,1)
.
In [257]: np.allclose(C,D)
Out[257]: True
In [258]: np.allclose(C,E)
Out[258]: True
In [259]: timeit C=np.sum(A*B, axis=1)
124 µs ± 4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [260]: timeit D=np.einsum('ijklm,jklm->iklm',A,B)
66 µs ± 18.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [261]: timeit E = (A.transpose(2,3,4,0,1)@B.transpose(1,2,3,0)[...,None])[...
...: ,0].transpose(3,0,1,2)
68.6 µs ± 973 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
tensordot
reshapes and transposes the arrays so it can do a simple 2d dot
- and then converts back. The last 3 dimensions are effectively a single 1000 dimension.
In [262]: np.tensordot(A,B,(1,0)).shape
Out[262]: (5, 10, 10, 10, 10, 10, 10)
In [263]: timeit np.tensordot(A,B,(1,0)).shape
74 ms ± 70.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The result is much bigger - a kind of outer product on the non-summing dimensions. The result is in there, buried as a multidimansional diagonal.
tensordot is effectively:
In [264]: X=(B.reshape(5,1000).T).dot(A.reshape(5,5,1000))
In [265]: X.shape
Out[265]: (1000, 5, 1000)
In [266]: timeit X=(B.reshape(5,1000).T).dot(A.reshape(5,5,1000))
78.9 ms ± 82.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)