How to get a sorted cumulative array of values in numpy?

Question

I have the following numpy arrays (which are actually a pandas column) which represent observations (a position and a value):

df['x'] = np.array([1, 2, 3, 2, 1, 1, 2, 3, 4, 5])
df['y'] = np.array([2, 1, 1, 1, 1, 1, 1, 1, 3, 2])

And instead, I would like to get the following two arrays:

[1 2 3 4 5]
[4 3 2 3 2]

Which is basically grouping all items with the same value in df['x'] and getting the cumulative sum of each value in df['y'], (or in other words getting the cumulative sum of values for each individual position).

Which is the most straightforward way to achieve that in numpy?

since they are in a dataframe, i think you can just do `df.groupby('x', as_index=False)['y'].sum()` — tdy, Feb 20 '22 at 22:51
Is there a reason why you don't want to use pandas for this? — Michael Butscher, Feb 20 '22 at 22:51
You can just export the result in numpy or you want to specifically do all of this in numpy? — Akmal Soliev, Feb 20 '22 at 22:52
I am curious to understand -for learning purposes- how this could be done -if there is such option- purely in numpy. — M.E., Feb 20 '22 at 22:55

Andras Deak -- Слава Україні · Accepted Answer · 2022-02-20T23:02:34.587

As others have noted in comments, if you're already using pandas it's probably a good idea to use a sum over groupby. That being said, if you insist on using raw NumPy you can find the unique indices of x and then sum up corresponding values in y in an accumulator array:

import numpy as np

x = np.array([1, 2, 3, 2, 1, 1, 2, 3, 4, 5])
y = np.array([2, 1, 1, 1, 1, 1, 1, 1, 3, 2])

vals, inds = np.unique(x, return_inverse=True)
res = np.zeros_like(vals, dtype=y.dtype)
np.add.at(res, inds, y)

print(res)
# [4 3 2 3 2]

vals are the unique values in x and are not actually used here. inds is the key: these are the index of each value of x in vals. These are the positions in the result where we want to accumulate corresponding values from y. The last trick is using np.add.at for an unbuffered summation.

The result is stored in res.

score 1 · Answer 2 · answered Feb 21 '22 at 01:36

We can try

def groupby(a, b):
    sidx = b.argsort(kind='mergesort')
    a_sorted = a[sidx]
    b_sorted = b[sidx]
    cut_idx = np.flatnonzero(np.r_[True,b_sorted[1:] != b_sorted[:-1],True])
    out = [sum(a_sorted[i:j]) for i,j in zip(cut_idx[:-1],cut_idx[1:])]
    return out


groupby(df['y'].values,df['x'].values)
Out[223]: [4, 3, 2, 3, 2]

Notice the original function you can refer to Divakar 's answer (Thanks Divakar again :-), for teaching me bumpy)

How to get a sorted cumulative array of values in numpy?

2 Answers2