By using numpy and some maths it is possible to speed-up your code, so it becomes faster than the current numba-version by an order of magnitude. We will also see, using numba on the improved function makes it even faster.
It is quite often, that numba is overused - often it is possible to write numpy-only code which is quite efficient - this is also the case here.
A problem with the numpy code at hand: one should not access single elements but leverage numpy's build-in functions - they are as fast as it gets most of the time. Only if it is impossible to use those numpy-functions, one would use numba or cython.
However, the biggest issue here is the formulation of the problem. For fixed i
and j
we have the following formula to calculate (I simplified it a little bit):
g[i,j]=sum_ii sum_jj exp(value_ii+value_jj)
=sum_ii sum_jj exp(value_ii)*exp(value_jj)
=sum_ii exp(value_ii) * sum_jj exp(value_jj)
To evaluate the last formula we need O(n+m)
operations, but for the first, naive formula O(n*m)
- quite a difference!
A first version leveraging numpy-functionality could be similar to:
def calc_ead(b,s0,e,dx):
n,m=b.shape
norm = 1.0/(2*np.pi*s0**2)
gb = np.zeros((n,m))
vI=np.arange(n)
vJ=np.arange(m)
for i in range(n):
for j in range(m):
II=(i-vI)*dx
JJ=(j-vJ)*dx
denom=2.0*(s0*(1.0+e*b[i,j]))**2
expII=np.exp(-II*II/denom)
expJJ=np.exp(-JJ*JJ/denom)
gb[i,j]=norm*(expII.sum()*expJJ.sum())
return gb
And now, compared to the original numba-implementation:
>>> a=np.random.random((256,256))
>>> print(calc_gb_gauss_2d_nb(a,0.1,1.0,0.5)[1,1])
15.9160709993
>>> %timeit -n1 -r1 calc_gb_gauss_2d_nb(a,0.1,1.0,0.5)
1min 6s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
and now numpy-function:
>>> print(calc_ead(a,0.1,1.0,0.5)[1,1])
15.9160709993
>>> %timeit -n1 -r1 calc_ead(a,0.1,1.0,0.5)
1.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
There are two observations:
- the results are the same.
- the numpy version is 37 times faster, for bigger problems this difference will become even greater.
Clearly, you could leverage numba to even bigger speed-up. However, this is still a good idea to use numpy-functionality when possible - it is quite surprising, how subtle the simplest things could - for example even calculating a sum:
>>> nb_calc_ead = njit(double[:, :](double[:, :],double,double,double))(calc_ead)
>>>print(nb_calc_ead(a,0.1,1.0,0.5)[1,1])
15.9160709993
>>>%timeit -n1 -r1 nb_calc_ead(a,0.1,1.0,0.5)
587 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
There is another factor 3!
This problem could be parallelized, but this is not trivial to do it right. My cheap try using explicit loop parallelization:
from numba import njit, prange
import math
@njit(parallel=True) #needed, so it is parallelized
def parallel_nb_calc_ead(b,s0,e,dx):
n,m=b.shape
norm = 1.0/(2*np.pi*s0**2)
gb = np.zeros((n,m))
vI=np.arange(n)
vJ=np.arange(m)
for i in prange(n): #outer loop = explicit prange-loop
for j in range(m):
denom=2.0*(s0*(1.0+e*b[i,j]))**2
expII=np.zeros((n,))
expJJ=np.zeros((m,))
for k in range(n):
II=(i-vI[k])*dx
expII[k]=math.exp(-II*II/denom)
for k in range(m):
JJ=(j-vJ[k])*dx
expJJ[k]=math.exp(-JJ*JJ/denom)
gb[i,j]=norm*(expII.sum()*expJJ.sum())
return gb
And now:
>>> print(parallel_nb_calc_ead(a,0.1,1.0,0.5)[1,1])
15.9160709993
>>> %timeit -n1 -r1 parallel_nb_calc_ead(a,0.1,1.0,0.5)
349 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
means almost another factor 2 (my machine has only two CPU, depending on the hardware the speed-up could be bigger). By the way we are almost 200 times faster than the original version.
I bet one could improve the above code, but I'm not going there.
Listing current version with which calc_ead
is compared:
import numpy as np
from numba import njit, double
def calc_gb_gauss_2d(b,s0,e,dx):
n,m=b.shape
norm = 1.0/(2*np.pi*s0**2)
gb = np.zeros((n,m))
for i in range(n):
for j in range(m):
for ii in range(n):
for jj in range(m):
gb[i,j]+=np.exp(-(((i-ii)*dx)**2+((j-jj)*dx)**2)/(2.0*(s0*(1.0+e*b[i,j]))**2))
gb[i,j]*=norm
return gb
calc_gb_gauss_2d_nb = njit(double[:, :](double[:, :],double,double,double))(calc_gb_gauss_2d)