removing loops with numpy.einsum

Question

I have a some nested loops (three total) where I'm trying to use numpy.einsum to speed up the calculations, but I'm struggling to get the notation correct. I managed to get rid of one loop, but I can't figure out the other two. Here's what I've got so far:

import numpy as np
import time

def myfunc(r, q, f):
    nr = r.shape[0]
    nq = q.shape[0]
    y = np.zeros(nq)
    for ri in range(nr):
        for qi in range(nq):
            y[qi] += np.einsum('i,i',f[ri,qi]*f[:,qi],np.sinc(q[qi]*r[ri,:]/np.pi))
    return y

r = np.random.random(size=(1000,1000))
q = np.linspace(0,1,1001)
f = np.random.random(size=(r.shape[0],q.shape[0]))

start = time.time()
y = myfunc(r, q, f)
end = time.time()

print(end-start)

While this was much faster than the original, this is still too slow, and takes about 30 seconds. Note the original without the einsum call was the following (which looks like it will take ~2.5 hours, didn't wait to find out for sure):

def myfunc(r, q, f):
    nr = r.shape[0]
    nq = q.shape[0]
    y = np.zeros(nq)
    for ri in range(nr):
        for rj in range(nr):
            for qi in range(nq):
                y[qi] += f[ri,qi]*f[rj,qi]*np.sinc(q[qi]*r[ri,rj]/np.pi))
    return y

Does anyone know how to get rid of these loops with an einsum, or any other tool for that matter?

score 6 · Accepted Answer · answered Oct 06 '20 at 03:45

6

Your function seems to be equivalent to the following:

# this is so called broadcasting
s = np.sinc(q * r[...,None]/np.pi)

np.einsum('iq,jq,ijq->q',f,f,s)

Which took about 20 seconds on my system, with most of the time to allocate s.

Let's test it for a small sample:

np.random.seed(1)
r = np.random.random(size=(10,10))
q = np.linspace(0,1,1001)
f = np.random.random(size=(r.shape[0],q.shape[0]))
(np.abs(np.einsum('iq,jq,ijq->q',f,f,s) - myfunc(r,q,f)) < 1e-6).all()
# True

Since np.sinc is not a linear operator, I'm not quite sure how we can further reduce the run time.

answered Oct 06 '20 at 03:45

Quang Hoang

146,074
10
56
74

thanks. just what I was looking for. This helps me understand einsum a bit better too. Maybe I'll see if I can make a lookup table for the np.sinc, or maybe try and bin the r values to make fewer of them at the expense of precision (which might be okay, I'll have to test it). – tomerg Oct 06 '20 at 15:04
The r array is actually a symmetric matrix. Is there a good way to use that to only perform half the calculations? – tomerg Oct 06 '20 at 15:05

Divakar · Answer 2 · 2020-10-07T05:50:27.477

That sinc is the actual bottleneck, as also mentioned in @Quang Hoang's post. We will make use of the einsum expression from there to end up with one way like so -

Now, from docs, numpy.sinc(x) is : \sin(\pi x)/(\pi x). We will make use of it -

v = q*r[...,None]
p = np.sin(v)/v
mask = (q==0) | (r==0)[...,None]
p[mask] = 1
out = np.einsum('iq,jq,ijq->q',f,f,p)

Also, for large data, we can leverage multi-cores with numexpr, like so -

import numexpr as ne

p = ne.evaluate('sin(q*r3D)/(q*r3D)', {'r3D':r[...,None]})
mask = (q==0) | (r==0)[...,None]
p[mask] = 1
out = np.einsum('iq,jq,ijq->q',f,f,p)

Timings with 500 length arrays -

In [12]: r = np.random.random(size=(500,500))
    ...: q = np.linspace(0,1,501)
    ...: f = np.random.random(size=(r.shape[0],q.shape[0]))

# Original soln with einsum
In [15]: %%timeit
    ...: nr = r.shape[0]
    ...: nq = q.shape[0]
    ...: y = np.zeros(nq)
    ...: for ri in range(nr):
    ...:     for qi in range(nq):
    ...:         y[qi] += np.einsum('i,i',f[ri,qi]*f[:,qi],np.sinc(q[qi]*r[ri,:]/np.pi))
9.75 s ± 977 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# @Quang Hoang's soln
In [16]: %%timeit
    ...: s = np.sinc(q * r[...,None]/np.pi)
    ...: np.einsum('iq,jq,ijq->q',f,f,s)
2.75 s ± 7.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [17]: %%timeit
    ...: p = ne.evaluate('sin(q3D*r)/(q3D*r)', {'q3D':q[:,None,None]})
    ...: mask = (q==0)[:,None,None] | (r==0)
    ...: p[mask] = 1
    ...: out = np.einsum('iq,jq,qij->q',f,f,p)
1.39 s ± 23.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [18]: %%timeit
    ...: v = q*r[...,None]
    ...: p = np.sin(v)/v
    ...: mask = (q==0) | (r==0)[...,None]
    ...: p[mask] = 1
    ...: out = np.einsum('iq,jq,ijq->q',f,f,p)
2.11 s ± 7.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With larger data, we expect numexpr one to perform better, as long as we don't run into out-of-memory cases.

thanks. this is great too. though for various reasons I'm trying to keep this to a NumPy solution. — tomerg, Oct 06 '20 at 20:30

max9111 · Answer 3 · 2020-10-07T23:31:46.963

The simplest way (and likely the most performant) is to use an compiler, for example Numba. Since this function depends on the sinc function, also make sure that you have Intel SVML installed.

Example

import numpy as np
import numba as nb

@nb.njit(fastmath=True,parallel=False,error_model="numpy",cache=True)
def myfunc(r, q, f):
    nr = r.shape[0]
    nq = q.shape[0]
    y = np.zeros(nq)
    for ri in range(nr):
        for rj in range(nr):
            for qi in range(nq):
                y[qi] += f[ri,qi]*f[rj,qi]*np.sinc(q[qi]*r[ri,rj]/np.pi)
    return y

@nb.njit(fastmath=True,parallel=True,error_model="numpy",cache=True)
def myfunc_opt(r, q, f):
    nr = r.shape[0]
    nq = q.shape[0]
    y = np.empty(nq)

    #for contiguous memory access in the loop
    f_T=np.ascontiguousarray(f.T)
    for qi in nb.prange(nq):
        acc=0
        for ri in range(nr):
            for rj in range(nr):
                acc += f_T[qi,ri]*f_T[qi,rj]*np.sinc(q[qi]*r[ri,rj]/np.pi)
        y[qi]=acc
    return y

@nb.njit(fastmath=True,parallel=True,error_model="numpy",cache=True)
def myfunc_opt_2(r, q, f):
    nr = r.shape[0]
    nq = q.shape[0]
    y = np.empty(nq)


    f_T=np.ascontiguousarray(f.T)
    for qi in nb.prange(nq):
        acc=0
        for ri in range(nr):
            for rj in range(nr):
                #Test carefully!
                if q[qi]*r[ri,rj]!=0.:
                    acc += f_T[qi,ri]*f_T[qi,rj]*np.sin(q[qi]*r[ri,rj])/(q[qi]*r[ri,rj])
                else:
                    acc += f_T[qi,ri]*f_T[qi,rj]
        y[qi]=acc
    return y

def numpy_func(r, q, f):
    s = np.sinc(q * r[...,None]/np.pi)
    return np.einsum('iq,jq,ijq->q',f,f,s)

Timings with small arrays

r = np.random.random(size=(500,500))
q = np.linspace(0,1,501)
f = np.random.random(size=(r.shape[0],q.shape[0]))
%timeit y = myfunc(r, q, f)
#765 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit y = myfunc_opt(r, q, f)
#158 ms ± 2.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit y = myfunc_opt_2(r, q, f)
#51.5 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit y = numpy_func(r, q, f)
#3.81 s ± 61.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
print(np.allclose(numpy_func(r, q, f),myfunc(r, q, f)))
#True
print(np.allclose(numpy_func(r, q, f),myfunc_opt(r, q, f)))
#True
print(np.allclose(numpy_func(r, q, f),myfunc_opt_2(r, q, f)))

Timings with larger arrays

r = np.random.random(size=(1000,1000))
q = np.linspace(0,1,1001)
f = np.random.random(size=(r.shape[0],q.shape[0]))
%timeit y = myfunc(r, q, f)
#6.1 s ± 4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit y = myfunc_opt(r, q, f)
#1.26 s ± 18.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit y = myfunc_opt_2(r, q, f)
#397 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This did speed it up ~4x after I ran it once, presumably to fill some kind of cache I'm guessing? I might add this and the answer from @Divakar as options in case some users happen to have either Numba or numexpr available, then just default to numpy if not. How does this cache work in terms of running this many times for different r/f/q arrays, does it still speed it up if the r/f/q arrays are different, or does it require the r/f/q arrays to be the same for the cache to be useful? — tomerg, Oct 07 '20 at 20:19
@tomerg Only a factor of 4 seems to be too less. Have you bench marked exactly my implementation (it is important to sum to a scalar `acc` in my example, to have SVML working and a recent Numba version -> There was a bug in the SVML implementation). The caching is to cache the compiled function. It will work as long as you don't change the datatype of the arrays or if you use non-contiguous arrays, in this cases a recompilation will occur which is also added to the cache. Numba is quite the same as Clang is for C. Both translate to LLVM-IR, the machine code is than generated by LLVM. — max9111, Oct 07 '20 at 20:32
I was using a smaller array for q for my testing. I just reran it with the original q with 1000 elements and indeed the speed-up is now more like 30x. — tomerg, Oct 07 '20 at 20:46
Can you confirm your results with Numba are the same as einsum? I'm getting some weird results that are way off (like 10^57 off). I copy/pasted your function exactly and am trying to find out what I screwed up. — tomerg, Oct 07 '20 at 21:07
I think you had the loops mixed up. I made the q loop to be the outer loop, then set `acc=0` before both the `ri` and `rj` loops. Now the numbers match up. — tomerg, Oct 07 '20 at 21:37
@tomerg I have corrected the answer and added some optimizations. — max9111, Oct 07 '20 at 23:32

score 0 · Answer 4 · answered Oct 07 '20 at 07:07

I'm writing this answer because I really learnt a lot about einsum from @Quang Hoang's post, and it will strengthen my understanding if I share how I think in the process of solving this problem.

The problem is to design a proper einsum operation for

y[qi] += np.einsum('i,i',f[ri,qi]*f[:,qi],np.sinc(q[qi]*r[ri,:]/np.pi))

Look at the shapes of the relevant arrays. Inputs: r --> (a,a),q --> (c,), and f --> (a,c); Output: y --> (c,). From these shapes, for the part q[qi]*r[ri,:], a new array of shape (a,a,c) must be created by something like r[...,None]*q.
Since np.sinc won't change an array's shape, and for a fixed pair of (ri,qi), f[ri,qi] is only a number, we should first think what einsum operation will reproduce f[:,qi],np.sinc(q[qi]*r[ri,:]/np.pi), i.e., how to obtain an array of shape (a,c) from two arrays of shapes (a,c) and (a,a,c). Intuitively, it is kl,ikl->il.
For a fixed pair of (ri,qi), since f[:,qi],np.sinc(q[qi]*r[ri,:]/np.pi) gives an array of shape (a,c) and the shape of f is (a,c), the final operation is simply il,il->l

Based on the analysis above, we have the solution:

s = q*r[...,None]/np.pi
res = np.einsum('kl,ikl->il',f,s)
res = np.einsum('il,il->l',f,res)

removing loops with numpy.einsum

4 Answers4