Cumsum with restarts

Question

I want to bin the data every time the threshold 10000 is exceeded.

I have tried this with no luck:

# data which is an array of floats

diff = np.diff(np.cumsum(data)//10000, prepend=0)

indices = (np.argwhere(diff > 0)).flatten()

The problem is that all the bins does not contain 10000, which was my goal.

Expected output

input_data = [4000, 5000, 6000, 2000, 8000, 3000]
# (4000+5000+6000 >= 10000. Index 2)
# (2000+8000 >= 10000. Index 4)
Output: [2, 4]

I wonder if there is any alternative to a for loop?

yatu · Accepted Answer · 2020-06-12T15:56:13.167

Not sure how this could be vectorized, if it even can be, since by taking the cumulative sum you'll be propagating the remainders each time the threshold is surpassed. So probably this is a good case for numba, which will compile the code down to C level, allowing for a loopy but performant approach:

from numba import njit, int32

@njit('int32[:](int32[:], uintc)')
def windowed_cumsum(a, thr):
    indices = np.zeros(len(a), int32) 
    window = 0
    ix = 0
    for i in range(len(a)):
        window += a[i]
        if window >= thr:
            indices[ix] = i
            ix += 1
            window = 0
    return indices[:ix]

The explicit signature implies ahead of time compilation, though this enforces specific dtypes on the input array. The inferred dtype for the example array is of int32, though if this might not always be the case or for a more flexible solution you can always ignore the dtypes in the signature, which will only imply that the function will be compiled on the first execution.

input_data = np.array([4000, 5000, 6000, 2000, 8000, 3000])

windowed_cumsum(input_data, 10000)
# array([2, 4])

Also @jdehesa raises an interesting point, which is that for very long arrays compared to the number of bins, a better option might be to just append the indices to a list. So here is an alternative approach using lists (also in no-python mode), along with timings under different scenarios:

from numba import njit, int32

@njit
def windowed_cumsum_list(a, thr):
    indices = []
    window = 0
    for i in range(len(a)):
        window += a[i]
        if window >= thr:
            indices.append(i)
            window = 0
    return indices

a = np.random.randint(0,10,10_000)

%timeit windowed_cumsum(a, 20)
# 16.1 µs ± 232 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit windowed_cumsum_list(a, 20)
# 65.5 µs ± 623 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit windowed_cumsum(a, 2000)
# 7.38 µs ± 167 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit windowed_cumsum_list(a, 2000)
# 7.1 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

So it seems that under most scenarios using numpy will be a faster option, since even in the second case, with a length 10000 array and a resulting array of 20 indices of bins, both perform similarly, though for memory efficiency reasons the latter might be more convenient in some cases.

I think Numba is a good fit for this problem, although I'm not sure why you need to maintain the whole array `window`, can't you just have the current single accumulated value and reset it after each bin is found? For `indices`, if the array is going to be very big in comparison to the number of bins, one could also consider just using a list and append. — jdehesa, Jun 12 '20 at 15:20
Initially had a slightly different approach, where it made sense to keep that. But totally didn't serve any purpose here, updated @jdehesa — yatu, Jun 12 '20 at 15:23
Was curious about how using a list would compare. Added some timings fyi @jdehesa — yatu, Jun 12 '20 at 15:41

jdehesa · Answer 2 · 2020-06-12T15:06:00.117

Here is how you can do it fairly efficiently with a loop, using np.searchsorted to find bin boundaries fast:

import numpy as np

np.random.seed(0)
bin_size = 10_000
data = np.random.randint(100, size=20_000)

# Naive solution (incorrect, for comparison)
data_f = np.floor(np.cumsum(data) / bin_size).astype(int)
bin_starts = np.r_[0, np.where(np.diff(data_f) > 0)[0] + 1]
# Check bin sizes
bin_sums = np.add.reduceat(data, bin_starts)
# We go over the limit!
print(bin_sums.max())
# 10080

# Better solution with loop
data_c = np.cumsum(data)
ref_val = 0
bin_starts = [0]
while True:
    # Search next split point
    ref_idx = bin_starts[-1]
    # Binary search through remaining cumsum
    next_idx = np.searchsorted(data_c[ref_idx:], ref_val + bin_size, side='right')
    next_idx += ref_idx
    # If we finished the array stop
    if next_idx >= len(data_c):
        break
    # Add new bin boundary
    bin_starts.append(next_idx)
    ref_val = data_c[next_idx - 1]
# Convert bin boundaries to array
bin_starts = np.array(bin_starts)
# Check bin sizes
bin_sums = np.add.reduceat(data, bin_starts)
# Does not go over limit
print(bin_sums.max())
# 10000

Any `cumsum` solution would not do what OP wants. So your `Naive solution` isn't exactly what's needed. Not sure about your vectorized either. — Quang Hoang, Jun 12 '20 at 15:05
@QuangHoang Yes, the naive solution is wrong, is what OP was doing, I clarified that now. I think the second solution is correct, although it's not vectorized, it just uses the cumsum to find the next bin boundary faster (admittedly, if using float data on a very big array cumsum may eventually result in too much error). It would actually be even faster to simply iterate through data until the next bin is found at C-like speed, should be easy to do that with Numba. — jdehesa, Jun 12 '20 at 15:11

Cumsum with restarts

2 Answers2

Linked