This is a contrived test case but, hopefully, it can suffice to convey the point and ask the question. Inside of a Numba njit
function, I noticed that it is very costly to assign a locally computed value to an array element. Here are two example functions:
from numba import njit
import numpy as np
@njit
def slow_func(x, y):
result = y.sum()
for i in range(x.shape[0]):
if x[i] > result:
x[i] = result
else:
x[i] = result
@njit
def fast_func(x, y):
result = y.sum()
for i in range(x.shape[0]):
if x[i] > result:
z = result
else:
z = result
if __name__ == "__main__":
x = np.random.rand(100_000_000)
y = np.random.rand(100_000_000)
%timeit slow_func(x, y) # 177 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit fast_func(x, y) # 407 ns ± 12.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
I understand that the two functions aren't quite doing the same thing but let's not worry about that for now and stay focused on the "slow assignment". Also, due to Numba's lazy initialization, the timing above has been re-run post JIT-compiling. Notice that both functions are assigning result
to either x[i]
or to z
and the number of assignments are the same in both cases. However, the assignment of result
to z
is substantially faster. Is there a way to make the slow_func
as fast as the fast_func
?