method2
is the basic way to do that in Numpy. That being said it is not very well optimized yet internally for such a case. Indeed, the reduction is done along a very small number of items and the internal reduction is optimized for a relatively large number of items. AFAIR, compilers like GCC tends to auto-vectorize the code using SIMD instructions resulting in a much slower execution for small reductions. It might be optimized in the future but this is tricky to do since the problem is mainly related to the way compilers optimize the code and the assumption they make during the optimization steps. Thus, it is not really a problem of Numpy though there are ways to specifically optimize this use-case at the expensive of a less-maintainable code.
method3
is not very efficient since np.add.reduceat
is currently not yet very-optimized internally in Numpy. We plan to do that but one should not expect a drastic improvement since the method is fundamentally not very efficient on modern CPUs anyway.
method1
is clever because it makes use of BLAS that are very optimized internally. The default implementation on most platform, OpenBLAS, carefully optimize many use-case, including small matrices/vectors multiplications, resulting in a much faster execution. That being said, it is not optimal due to the unneeded multiplications by ones (BLAS does not optimize the computations based on the content of the values).
AFAIK, there is no way to write a faster implementation than method1
in pure Numpy. As a result, the only option left to speed up the code is to execute a natively-compiled code specifically design to solve your use-case. This is possible using Numba or Cython. Here is a naive implementation:
import numba as nb
@nb.njit('(float64[::1],)')
def method4(arr):
res = np.empty(n)
for i in range(n):
res[i] = arr[i*4] + arr[i*4+1] + arr[i*4+2] + arr[i*4+3]
return res
If you run this code, you will certainly get similar performance results than BLAS demonstrating how good BLAS implementations are (in fact, OpenBLAS is a bit faster on my machine). This code is not optimal because it is mainly memory-bound and page faults slow things down on most systems (see this related post). You can mitigate their overheads using multiple threads. This is still not optimal as page faults does not scale well on all platforms (quite fine on Linux but poor on Windows). Alternatively, you can preallocate the output array once so to pay this overhead only once. You can even mix both approaches regarding your needs (suing multiple threads can be useful to ensure the memory is saturated whatever the target platform though creating threads can be expensive). Here is the naive parallel implementation and an optimized parallel implementation:
# Naive parallel implementation mitigating a bit the page-faults overhead
@nb.njit('(float64[::1],)', parallel=True)
def method5(arr):
res = np.empty(n)
for i in nb.prange(n):
res[i] = arr[i*4] + arr[i*4+1] + arr[i*4+2] + arr[i*4+3]
return res
# Parallel implementation avoiding completely page-faults
# (assuming `res` is preallocated and filled)
@nb.njit('(float64[::1],float64[::1])', parallel=True)
def method6(arr, res):
for i in nb.prange(n):
res[i] = arr[i*4] + arr[i*4+1] + arr[i*4+2] + arr[i*4+3]
Benchmark
method1: 3.64 ms
method2: 11.7 ms
method3: 16.0 ms
method4: 3.88 ms
method5: 2.05 ms
method6: 0.84 ms <----
This last method is nearly optimal and 4.3 times faster than the previously fastest BLAS one.