Making four nested for loops even faster with Numba

Question

I'm a bit new to work with Numba, but I got the gist of it. I wonder if there any more advanced tricks to make four nested for loops even faster that what I have now. In particular, I need to calculate the following integral:

Where B is a 2D array, and S0 and E are certain parameters. My code is the following:

import numpy as np
from numba import njit, double

def calc_gb_gauss_2d(b,s0,e,dx):
    n,m=b.shape
    norm = 1.0/(2*np.pi*s0**2)
    gb = np.zeros((n,m))
    for i in range(n):
        for j in range(m):
            for ii in range(n):
                for jj in range(m):
                    gb[i,j]+=np.exp(-(((i-ii)*dx)**2+((j-jj)*dx)**2)/(2.0*(s0*(1.0+e*b[i,j]))**2))
            gb[i,j]*=norm
    return gb

calc_gb_gauss_2d_nb = njit(double[:, :](double[:, :],double,double,double))(calc_gb_gauss_2d)

For and input array of size 256x256 the calculation speed is:

In [4]: a=random.random((256,256))

In [5]: %timeit calc_gb_gauss_2d_nb(a,0.1,1.0,0.5)
The slowest run took 8.46 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 1min 1s per loop

Comparison between pure Python and Numba calculation speed give me this picture:

Is there any way to optimize my code for better performance?

Compute `(2.0*(s0*(1.0+e*b[i,j]))**2)` in the `j` loop, instead of the inner most loop. — Moses Koledoye, May 09 '18 at 11:21
Also your question would be more suitable for [code review](https://codereview.stackexchange.com/) since your code works and you're looking for ways to improve. — Moses Koledoye, May 09 '18 at 11:22
Thanks a whole bunch..so should I remove this question from here and move it to code review? — Ohm, May 09 '18 at 15:21
I would say, take a look how many questions there are for numba on CodeReview. I think you have a better chance here... — ead, May 09 '18 at 15:25
1) Don't use explicit type declaration. (You can't explicitly declare that the input arrays are contigous in memory, which is necessary for SIMD-vectorization). Take a look at https://numba.pydata.org/numba-doc/dev/user/performance-tips.html (The fastmath=True keyword and using Intel SVML can make quite a difference in performance). Also use the latest Numba version, there have been some optimizations regarding performance of parallelized functions recently. — max9111, May 14 '18 at 07:55

ead · Answer 1 · 2018-05-25T17:23:54.147

By using numpy and some maths it is possible to speed-up your code, so it becomes faster than the current numba-version by an order of magnitude. We will also see, using numba on the improved function makes it even faster.

It is quite often, that numba is overused - often it is possible to write numpy-only code which is quite efficient - this is also the case here.

A problem with the numpy code at hand: one should not access single elements but leverage numpy's build-in functions - they are as fast as it gets most of the time. Only if it is impossible to use those numpy-functions, one would use numba or cython.

However, the biggest issue here is the formulation of the problem. For fixed i and j we have the following formula to calculate (I simplified it a little bit):

 g[i,j]=sum_ii sum_jj exp(value_ii+value_jj)
       =sum_ii sum_jj exp(value_ii)*exp(value_jj)
       =sum_ii exp(value_ii) * sum_jj exp(value_jj)

To evaluate the last formula we need O(n+m) operations, but for the first, naive formula O(n*m) - quite a difference!

A first version leveraging numpy-functionality could be similar to:

def calc_ead(b,s0,e,dx):
    n,m=b.shape
    norm = 1.0/(2*np.pi*s0**2)
    gb = np.zeros((n,m))
    vI=np.arange(n)
    vJ=np.arange(m)
    for i in range(n):
        for j in range(m):
            II=(i-vI)*dx
            JJ=(j-vJ)*dx
            denom=2.0*(s0*(1.0+e*b[i,j]))**2
            expII=np.exp(-II*II/denom)
            expJJ=np.exp(-JJ*JJ/denom)
            gb[i,j]=norm*(expII.sum()*expJJ.sum())
    return gb

And now, compared to the original numba-implementation:

>>> a=np.random.random((256,256))

>>> print(calc_gb_gauss_2d_nb(a,0.1,1.0,0.5)[1,1])
15.9160709993
>>> %timeit -n1 -r1 calc_gb_gauss_2d_nb(a,0.1,1.0,0.5)
1min 6s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

and now numpy-function:

>>> print(calc_ead(a,0.1,1.0,0.5)[1,1])
15.9160709993
>>> %timeit -n1 -r1 calc_ead(a,0.1,1.0,0.5)
1.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

There are two observations:

the results are the same.
the numpy version is 37 times faster, for bigger problems this difference will become even greater.

Clearly, you could leverage numba to even bigger speed-up. However, this is still a good idea to use numpy-functionality when possible - it is quite surprising, how subtle the simplest things could - for example even calculating a sum:

>>> nb_calc_ead = njit(double[:, :](double[:, :],double,double,double))(calc_ead)
>>>print(nb_calc_ead(a,0.1,1.0,0.5)[1,1])
15.9160709993
>>>%timeit -n1 -r1 nb_calc_ead(a,0.1,1.0,0.5)
587 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

There is another factor 3!

This problem could be parallelized, but this is not trivial to do it right. My cheap try using explicit loop parallelization:

from numba import njit, prange
import math

@njit(parallel=True)                 #needed, so it is parallelized
def parallel_nb_calc_ead(b,s0,e,dx):
    n,m=b.shape
    norm = 1.0/(2*np.pi*s0**2)
    gb = np.zeros((n,m))
    vI=np.arange(n)
    vJ=np.arange(m)
    for i in prange(n):             #outer loop = explicit prange-loop
        for j in range(m):
            denom=2.0*(s0*(1.0+e*b[i,j]))**2
            expII=np.zeros((n,))
            expJJ=np.zeros((m,))
            for k in range(n):
                II=(i-vI[k])*dx
                expII[k]=math.exp(-II*II/denom)

            for k in range(m):
                JJ=(j-vJ[k])*dx
                expJJ[k]=math.exp(-JJ*JJ/denom)
            gb[i,j]=norm*(expII.sum()*expJJ.sum())
    return gb

And now:

>>> print(parallel_nb_calc_ead(a,0.1,1.0,0.5)[1,1])
15.9160709993
>>> %timeit -n1 -r1 parallel_nb_calc_ead(a,0.1,1.0,0.5)
349 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

means almost another factor 2 (my machine has only two CPU, depending on the hardware the speed-up could be bigger). By the way we are almost 200 times faster than the original version.

I bet one could improve the above code, but I'm not going there.

Listing current version with which calc_ead is compared:

import numpy as np
from numba import njit, double

def calc_gb_gauss_2d(b,s0,e,dx):
    n,m=b.shape
    norm = 1.0/(2*np.pi*s0**2)
    gb = np.zeros((n,m))
    for i in range(n):
        for j in range(m):
            for ii in range(n):
                for jj in range(m):
                    gb[i,j]+=np.exp(-(((i-ii)*dx)**2+((j-jj)*dx)**2)/(2.0*(s0*(1.0+e*b[i,j]))**2))
            gb[i,j]*=norm
    return gb

calc_gb_gauss_2d_nb = njit(double[:, :](double[:, :],double,double,double))(calc_gb_gauss_2d)

This is of great help, thank you. Especially insightful is your comment on how to reduce to `O(n+m)` operations, I wasn't aware of this kind of consideration. As someone suggested in the comments, I've transfered this question to code review. I wonder how we can transfer your answer there (and I guess delete this post in order not to have two posts on the same subject) - https://codereview.stackexchange.com/q/194023/105140 — Ohm, May 10 '18 at 07:05
@Ohm I still think that this question is OK for SO: it is very specific and only because you did some work doesn't mean you should post it in code review. Let's face it, the users answering numba question here are not very active at code review. — ead, May 10 '18 at 07:18
On which Numba version and processor have you tested your code? On my machine (Core i7 4771, Numba 0.39 dev.). Your parallelized version fails the np.allclose() test to the original implementation, but your single threaded version passes all tests (The deviation is up to 4% on some values). The relative speed to my observations deviates also significantly (the original version takes 205s, your single threaded version takes 250ms, the parallelized version takes 60ms) — max9111, May 25 '18 at 13:17
@max9111 thanks for looking into it. I will not have the access to the machine for some time and don’t know the numba version by heart. But I will look into it as soon as I get my hands on it again. It is somewhat surprising, that the parallel function yields different results because as far as I can see there are no race conditions — ead, May 25 '18 at 13:50
I looked a bit into it. The parallelized version fails the test also if prange and parallel=True is removed. Both implementations of your implementations looks quite similar, but give other results. Maybe some compiler problem.... — max9111, May 25 '18 at 14:07
@max9111 the problem is probably the typo `JJ=i-vJ[k]*dx`: it should be `j` and not `i`. This should be a lesson to me to use `np.allclose` and not something lazy and less waterproof. — ead, May 25 '18 at 14:14

Making four nested for loops even faster with Numba

1 Answers1