Well one of the axes stays aligned in A
(first one) and B
(last one) and stays in output as well (last one) and is a very small looping number of 4
. So, we could simply loop over that one with with np.tensordot
for a tensor sum-reduction. The benefit of 4x
lesser memory congestion when working with such large datasets might overcome the 4x looping because the compute per iteration is also 4x
lesser.
Thus, a solution with tensordot
would be -
def func1(A, B):
out = np.empty(A.shape[1:3] + B.shape[1:])
for i in range(len(A)):
out[...,i] = np.tensordot(A[i], B[...,i],axes=(-1,0))
return out
Timings -
In [70]: A = np.random.rand(4,50,60,200) # Random NDarray
...: B = np.random.rand(200,1000,4) # Random NDarray
...: out = np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
# Einsum solution without optimize
In [71]: %timeit np.einsum('ijkl,lui->jkui', A, B)
2.89 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Einsum solution with optimize
In [72]: %timeit np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
2.79 s ± 9.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# @Paul Panzer's soln
In [74]: %timeit np.stack([np.tensordot(a,b,1) for a,b in zip(A,B.transpose(2,0,1))],-1)
183 ms ± 6.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [73]: %timeit func1(A,B)
158 ms ± 3.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Just to re-iterate the importance of memory-congestion and compute requirement, let's say we want to sum-reduce the last axis of length 4
as well, then we will see a noticeable difference in timings for optimal
version -
In [78]: %timeit np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
2.76 s ± 9.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [79]: %timeit np.einsum('ijkl,lui->jku', A, B, optimize="optimal")
93.8 ms ± 3.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, in that case, it would be better to go with einsum
.
Specific to given problem
Given that dimensions of A
and B
stay the same, the array-initialization with out = np.empty(A.shape[1:3] + B.shape[1:])
could be done as a one-time affair and loop through each call of the log-likelihood function with the proposed looping over to use tensordot
and update output out
.