With np.einsum('ikl,kjl->ijl', A, B)
, there is axis alignment requirement with string - l
that stays with the inputs and the output. As such, using np.tensordot
might not necessarily result in performance improvement, but since the question has specifically asked for it, let's suggest it anyway. Now, np.tensordot
would spread out the axes that don't take part in sum-reduction as separate axes, resulting in (N,N)
. So, to get to the final output, we need to extract the elements along the diagonal of the spread-out axes.
Here's how a solution with np.tensordot
would look like -
mask = np.eye(N,dtype=bool)
out = np.tensordot(A,B,axes=((1),(0))).swapaxes(1,2)[:,:,mask]
There would be cases where np.dot/np.tensordot
might come out as winner, but that requires that the sum-reduction axes have decent lengths. See this post
for a more detailed analysis.
For the given problem, that's not the case as the length of sum-reduction is just 4
. So, I would think einsum
would be best here!