Looping over np.einsum many times... Is there a faster way?

Question

I have a likelihood function that I am trying to sample with MCMC. I have used no for loops in the log likelihood itself, but I do call np.einsum() once.

Here's a sample of what my current code looks like:

A = np.random.rand(4,50,60,200) # Random NDarray
B = np.random.rand(200,1000,4)  # Random NDarray
out = np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")

The output out has dimensions (50,60,1000,4). This calculation is a bit too slow to allow for efficient MCMC sampling (~4 seconds on my machine), is there any way to speed it up? One useful piece of information is that for each call of the log-likelihood function, while the actual values in the arrays A and B are changing, the dimensions of each array remains fixed. I'd imagine this could be useful in speeding things up, since the same elements are always being multiplied together.

score 3 · Answer 1 · answered Jul 14 '20 at 18:20

Even when used in a small loop tensordot is more than 10x faster:

timeit(lambda:np.einsum('ijkl,lui->jkui', A, B, optimize="optimal"),number=5)/5
# 3.052245747600682
timeit(lambda:np.stack([np.tensordot(a,b,1) for a,b in zip(A,B.transpose(2,0,1))],-1),number=10)/10
# 0.23842503569903784

out_td = np.stack([np.tensordot(a,b,1) for a,b in zip(A,B.transpose(2,0,1))],-1)
out_es = np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
np.allclose(out_td,out_es)
# True

Divakar · Accepted Answer · 2020-07-14T19:02:18.693

Well one of the axes stays aligned in A (first one) and B (last one) and stays in output as well (last one) and is a very small looping number of 4. So, we could simply loop over that one with with np.tensordot for a tensor sum-reduction. The benefit of 4x lesser memory congestion when working with such large datasets might overcome the 4x looping because the compute per iteration is also 4x lesser.

Thus, a solution with tensordot would be -

def func1(A, B):
    out = np.empty(A.shape[1:3] + B.shape[1:])
    for i in range(len(A)):
        out[...,i] = np.tensordot(A[i], B[...,i],axes=(-1,0))
    return out

Timings -

In [70]: A = np.random.rand(4,50,60,200) # Random NDarray
    ...: B = np.random.rand(200,1000,4)  # Random NDarray
    ...: out = np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")

# Einsum solution without optimize    
In [71]: %timeit np.einsum('ijkl,lui->jkui', A, B)
2.89 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Einsum solution with optimize    
In [72]: %timeit np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
2.79 s ± 9.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)    

# @Paul Panzer's soln
In [74]: %timeit np.stack([np.tensordot(a,b,1) for a,b in zip(A,B.transpose(2,0,1))],-1)
183 ms ± 6.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [73]: %timeit func1(A,B)
158 ms ± 3.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Just to re-iterate the importance of memory-congestion and compute requirement, let's say we want to sum-reduce the last axis of length 4 as well, then we will see a noticeable difference in timings for optimal version -

In [78]: %timeit np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
2.76 s ± 9.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [79]: %timeit np.einsum('ijkl,lui->jku', A, B, optimize="optimal")
93.8 ms ± 3.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So, in that case, it would be better to go with einsum.

Specific to given problem

Given that dimensions of A and B stay the same, the array-initialization with out = np.empty(A.shape[1:3] + B.shape[1:]) could be done as a one-time affair and loop through each call of the log-likelihood function with the proposed looping over to use tensordot and update output out.

@Jsn If you are new to `tensordot`, this could be helpful - https://stackoverflow.com/questions/41870228/ for some examples with it. — Divakar, Jul 14 '20 at 19:12
That memory-congestion insight is instructive. I wasn't aware that the effects are so drastic. — Paul Panzer, Jul 14 '20 at 21:10
The memory-congestion example is more of an einsum vs BLAS issue. 78 will reorganize the tensors and perform a TDOT, 79 isn't possible to call a TDOT so it uses einsum loops. — Daniel, Jul 15 '20 at 20:02

Looping over np.einsum many times... Is there a faster way?

2 Answers2

Specific to given problem