On efficient solution is to accumulate lines of the input 2D array directly in the output array at a specific position and then perform the division.
The idea is to zero-initialize an output
array, then add [a, b, c]
to output[2:5]
, then add [d, e, f]
to output[1:4]
and then add [g, h, i]
to output[0:3]
. Finally, we can divide result
by [1, 2, 3, 2, 1]
.
Here is an implementation:
# Use the decorator @nb.njit here to use Numba
def compute(mat):
n = mat.shape[0]
output = np.zeros(n*2-1, dtype=np.float64)
for i in range(n-1, -1, -1):
output[i:i+n] += mat[n-1-i]
output[0:n] /= np.arange(1, n+1, 1, dtype=np.float64)
output[n:] /= np.arange(n-1, 0, -1, dtype=np.float64)
return output
Here is the resulting performance on my machine on a 1000x1000 array:
Initial code: 15.31 ms
Kevin's code: 3.26 ms ( x4.7)
This code: 1.42 ms (x10.8)
This code + Numba: 0.66 ms (x23.2)
Thus, this implementation is 10.8 times faster than the initial implementation and 2.3 times faster than the fastest alternative version. Note that using Numba results in an even faster code with almost not change to the code: this Numba version is 23.2 times faster than the initial implementation.