6

Data

Lets take the following 2d array:

starts = [0, 4, 10, 13, 23, 27]
ends = [4, 10, 13, 23, 27, 32]
lengths = [4, 6, 3, 10, 4, 5] 

arr = np.array([starts, ends, lengths]).T

Thus looking like:

[[ 0  4  4]
 [ 4 10  6]
 [10 13  3]
 [13 23 10]
 [23 27  4]
 [27 32  5]]

Goal

Now I want to "loop" through the lengths and as soon as soon as the cumaltive sum reaches 10 I want to output the starts and ends and then restart the cumulative counting.


Working code

tot_size = 0
start = 0

for i, numb in enumerate(arr[:,-1]):
    # update sum
    tot_size += numb

    # Check if target size is reached
    if tot_size >= 10:
        start_loc, end_loc = arr[:,0][start],  arr[:,1][i]
        print('Start: {}\nEnd: {}\nSum: {}\n'.format(start_loc, end_loc, tot_size))
        start = i + 1
        tot_size = 0

# Last part
start_loc, end_loc = arr[:,0][start],  arr[:,1][i]
print('Start: {}\nEnd: {}\nSum: {}\n'.format(start_loc, end_loc, tot_size))

Which will print:

Start: 0 End: 10 Sum: 10

Start: 10 End: 23 Sum: 13

Start: 23 End: 32 Sum: 9

(I don't need to know the resulting sum but I do need to know the starts and ends)


Numpy try

I suppose there must be a much more straightforward, or a vectorized, way of doing this with numpy.

  • cumsum + remainder

I was thinking of something like np.remainder(np.cumsum(arr[:,-1]), 10) however it will be "hard" to say when something is close to the target number (10 here), which is different from just splitting when sum > x

  • stride_tricks

Since the above doesn't work in a window I thought of stides but these windows are of fixed sizes

All ideas are welcome :)

CodeNoob
  • 1,988
  • 1
  • 11
  • 33
  • this is not possible to vectorize, but you can improve the speed, there are **many** examples on SO, [here is one](https://stackoverflow.com/questions/62346577/cumsum-with-restarts), [another](https://stackoverflow.com/questions/54208023/can-i-perform-dynamic-cumsum-of-rows-in-pandas) – mozway May 17 '22 at 07:40
  • 1
    You basically want to know which rows in the last column results in a new sum reaching 10? `np.unique(np.digitize(arr[:, -1].cumsum(), np.r_[10:arr[:,-1].cumsum().max():10]), return_index=True)[1][1:]` will give you the indices right? – Kevin May 17 '22 at 08:21
  • @mozway, the approach of @kevin also returns the desired `[2,4]` for the question you linked. `digitize` looks cool! Let me check with some other data but it looks like it works :)) – CodeNoob May 17 '22 at 08:40
  • @CodeNoob, there are many ways of doing this which can absolutely be vectorized imo., for instance if `cs = arr[:, -1].cumsum()` you could do `np.arange(1,len(cs))[np.diff((cs // 10)).astype(bool)]` which would give the same answer as my comment above. Or, you could probably use searchsorted with `np.searchsorted(cs, [10,20,30])` or use modulus. – Kevin May 17 '22 at 16:08
  • @Kevin can you add your idea as an answer so I can accept that? I asked for a numpy solution as I was curious not the most efficient/fastest as the current answer – CodeNoob May 19 '22 at 15:32
  • 1
    @CodeNoob I noticed a small error with `np.searchsorted(cs, np.arange(10,cs[-1],10), side='right')` when testing on some larger examples. I believe the cumsum `cs` will be wrong fundamentally since the remainder is being propagated at points where the threshold `10` is surpassed. I'd probably just go with the given solution :) – Kevin May 20 '22 at 21:47

1 Answers1

1

Numpy is not designed for solving efficiently such a problem. You can still solve this using some tricks or the usual combination of cumsum + division + diff + where or similar ones (like @Kevin proposed), but AFAIK they are all inefficient. Indeed, they require many temporary arrays and expensive operations.

Temporary arrays are expensive for two reasons: for small arrays, the overhead of Numpy function is typically of several microseconds per call resulting in generally in dozens of microseconds for the whole operation; and for big arrays, each operation will be memory bound and memory bandwidth is small on modern platforms. Actually, it is even worst since writing in newly allocated array is much slower due to page faults and Numpy array writes are currently not optimized on most platforms (including the mainstream x86-64 one).

As for "expensive operations" this includes sorting which runs in O(n log n) (quick-sort is used by default) and is generally memory bound, finding the unique values (which currently does a sort internally) and integer division which is known to be very slow since ever.


One solution to solve this problem is to use Numba (or Cython). Numba use a just-in-time compiler so to write fast optimized function. It is especially useful to write your own efficient basic Numpy built-ins. Here is an example based on your code:

import numba as nb
import numpy as np

@nb.njit(['(int32[:,:],)', '(int64[:,:],)'])
def compute(arr):
    n = len(arr)
    tot_size, start, cur = 0, 0, 0
    slices = np.empty((n, 2), arr.dtype)

    for i in range(n):
        tot_size += arr[i, 2]

        if tot_size >= 10:
            slices[cur, 0] = arr[start, 0]
            slices[cur, 1] = arr[i, 1]
            start = i + 1
            cur += 1
            tot_size = 0

    slices[cur, 0] = arr[start, 0]
    slices[cur, 1] = arr[i, 1]
    return slices[:cur+1]

For your small example, the Numba function takes about 0.75 us on my machine while the initial solution takes 3.5 us. In comparison, the Numpy solutions provided by @Kevin (returning the indices) takes 24 us for the np.unique and 6 us for the division-based solution. In fact, the basic np.cumsum already takes 0.65 us on my machine. Thus, the Numba solution is the fastest. It should be especially true for larger arrays.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59