Sum a numpy array in chunks

Question

Let's say I have a numpy array:

x = np.array([3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])

And I want to sum it in groups of, say, 3, so that the results is as follows:

np.array([12, 21, 30, 39])

Here is one way to do it:

n = x.size
out = x.reshape(n//3, 3) @ np.ones(3)

Is there a quicker way? I feel like this could be improved.

EDIT: just wanted to give an update for some of the methods described here

n = int(1e6)
arr = np.random.random(4*n)

def method1(arr):
    return arr.reshape(n, 4) @ np.ones(4)

def method2(arr):
    return arr.reshape(n, 4).sum(-1)

def method3(arr):
    return np.add.reduceat(arr, np.arange(0, 4*n, 4))

%timeit method1(arr)
1.53 ms ± 85.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit method2(arr)
14.6 ms ± 867 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit method3(arr)
14.2 ms ± 369 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

`np.add.reduceat(x, np.arange(0, len(x), 3))`, IIUC – Michael Szczesny Jun 14 '22 at 15:26 — Michael Szczesny, Jun 14 '22 at 15:26
`x.reshape(-1, 3).sum(-1)` – Mechanic Pig Jun 14 '22 at 15:29 — Mechanic Pig, Jun 14 '22 at 15:29

Jérôme Richard · Answer 1 · 2022-06-14T17:28:30.877

method2 is the basic way to do that in Numpy. That being said it is not very well optimized yet internally for such a case. Indeed, the reduction is done along a very small number of items and the internal reduction is optimized for a relatively large number of items. AFAIR, compilers like GCC tends to auto-vectorize the code using SIMD instructions resulting in a much slower execution for small reductions. It might be optimized in the future but this is tricky to do since the problem is mainly related to the way compilers optimize the code and the assumption they make during the optimization steps. Thus, it is not really a problem of Numpy though there are ways to specifically optimize this use-case at the expensive of a less-maintainable code.

method3 is not very efficient since np.add.reduceat is currently not yet very-optimized internally in Numpy. We plan to do that but one should not expect a drastic improvement since the method is fundamentally not very efficient on modern CPUs anyway.

method1 is clever because it makes use of BLAS that are very optimized internally. The default implementation on most platform, OpenBLAS, carefully optimize many use-case, including small matrices/vectors multiplications, resulting in a much faster execution. That being said, it is not optimal due to the unneeded multiplications by ones (BLAS does not optimize the computations based on the content of the values).

AFAIK, there is no way to write a faster implementation than method1 in pure Numpy. As a result, the only option left to speed up the code is to execute a natively-compiled code specifically design to solve your use-case. This is possible using Numba or Cython. Here is a naive implementation:

import numba as nb

@nb.njit('(float64[::1],)')
def method4(arr):
    res = np.empty(n)
    for i in range(n):
        res[i] = arr[i*4] + arr[i*4+1] + arr[i*4+2] + arr[i*4+3]
    return res

If you run this code, you will certainly get similar performance results than BLAS demonstrating how good BLAS implementations are (in fact, OpenBLAS is a bit faster on my machine). This code is not optimal because it is mainly memory-bound and page faults slow things down on most systems (see this related post). You can mitigate their overheads using multiple threads. This is still not optimal as page faults does not scale well on all platforms (quite fine on Linux but poor on Windows). Alternatively, you can preallocate the output array once so to pay this overhead only once. You can even mix both approaches regarding your needs (suing multiple threads can be useful to ensure the memory is saturated whatever the target platform though creating threads can be expensive). Here is the naive parallel implementation and an optimized parallel implementation:

# Naive parallel implementation mitigating a bit the page-faults overhead
@nb.njit('(float64[::1],)', parallel=True)
def method5(arr):
    res = np.empty(n)
    for i in nb.prange(n):
        res[i] = arr[i*4] + arr[i*4+1] + arr[i*4+2] + arr[i*4+3]
    return res

# Parallel implementation avoiding completely page-faults
# (assuming `res` is preallocated and filled)
@nb.njit('(float64[::1],float64[::1])', parallel=True)
def method6(arr, res):
    for i in nb.prange(n):
        res[i] = arr[i*4] + arr[i*4+1] + arr[i*4+2] + arr[i*4+3]

Benchmark

method1: 3.64 ms
method2: 11.7 ms
method3: 16.0 ms

method4: 3.88 ms
method5: 2.05 ms
method6: 0.84 ms <----

This last method is nearly optimal and 4.3 times faster than the previously fastest BLAS one.

I get opposite results [(colab notebook)](https://colab.research.google.com/drive/1RfNeRD9V-naKMmh6jb0o3SL4kBGw18Ep?usp=sharing) on older hardware for `np.reduce.at` vs `.reshape().sum(-1)`. — Michael Szczesny, Jun 14 '22 at 17:11
@MichaelSzczesny Thank you for pointing out this. It was a typo. This is now fixed. The timings are interesting. The platform can have indeed a significant impact on the methods since the compiler used for method2 and method3 certainly have a strong impact. I ran the test on Windows so I guess the default compiler is a bit different. Additionally, the latency of the instructions can impact the two methods as well as the SIMD instruction set for method2. However, in the end, both are inefficient anyway. — Jérôme Richard, Jun 14 '22 at 17:36

Sum a numpy array in chunks

1 Answers1

Benchmark