Numpy: conditional sum

Question

I have the following numpy array:

import numpy as np
arr = np.array([[1,2,3,4,2000],
                [5,6,7,8,2000],
                [9,0,1,2,2001],
                [3,4,5,6,2001],
                [7,8,9,0,2002],
                [1,2,3,4,2002],
                [5,6,7,8,2003],
                [9,0,1,2,2003]
              ])

I understand np.sum(arr, axis=0) to provide the result:

array([   40,    28,    36,    34, 16012])

what I would like to do (without a for loop) is sum the columns based on the value of the last column so that the result provided is:

array([[   6,    8,   10,   12, 4000],
       [  12,    4,    6,    8, 4002],
       [   8,   10,   12,    4, 4004],
       [  14,    6,    8,   10, 4006]])

I realize that it may be a stretch to do without a loop, but hoping for the best...

If a for loop must be used, then how would that work?

I tried np.sum(arr[:, 4]==2000, axis=0) (where I would substitute 2000 with the variable from the for loop), however it gave a result of 2

Does the value in the right column always get repeated exactly twice, or is that just a coincidence in your example? — Mad Physicist, May 01 '18 at 18:47
@cᴏʟᴅsᴘᴇᴇᴅ Could you reopen please? I am working on a pure-numpy solution. — Mad Physicist, May 01 '18 at 18:51
@MadPhysicist Alright, no problems, I'd be interested to see that as well. — cs95, May 01 '18 at 18:52
coincidence (i basically have a lot of data that I want to sum by year). `df.groupby(4, axis=0).sum()` does give me exactly what I need. I will leave unanswered as I would like to know if same thing can be accomplished with numpy, but thanks! — Infinity Cliff, May 01 '18 at 18:54
@cᴏʟᴅsᴘᴇᴇᴅ. Thanks for that I posted an answer. — Mad Physicist, May 01 '18 at 19:07
@InfinityCliff though a `numpy`-only solution might be interesting, sometimes is good not to reinvent the wheel and just use some library with a `groupby` function :) — rafaelc, May 01 '18 at 19:20

rafaelc · Answer 1 · 2018-05-01T18:59:22.640

I'm posting a simple solution with pandas and one with itertools

import pandas as pd
df = pd.DataFrame(arr)
x = df.groupby(4).sum().reset_index()[range(5)] #range(5) adjusts ordering 
x[4] *= 2
np.array(x)

array([[   6,    8,   10,   12, 4000],
       [  12,    4,    6,    8, 4002],
       [   8,   10,   12,    4, 4004],
       [  14,    6,    8,   10, 4006]])

You can also use itertools

np.array([sum(x[1]) for x in itertools.groupby(arr, key = lambda k: k[-1])])

array([[   6,    8,   10,   12, 4000],
       [  12,    4,    6,    8, 4002],
       [   8,   10,   12,    4, 4004],
       [  14,    6,    8,   10, 4006]])

score 4 · Accepted Answer · answered May 01 '18 at 19:05

You can do this in pure numpy using a clever application of np.diff and np.add.reduceat. np.diff will give you the indices where the rightmost column changes:

d = np.diff(arr[:, -1])

np.where will convert your boolean index d into the integer indices that np.add.reduceat expects:

d = np.where(d)[0]

reduceat will also expect to see a zero index, and everything needs to be shifted by one:

indices = np.r_[0, e + 1]

Using np.r_ here is a bit more convenient than np.concatenate because it allows scalars. The sum then becomes:

result = np.add.reduceat(arr, indices, axis=0)

This can be combined into a one-liner of course:

>>> result = np.add.reduceat(arr, np.r_[0, np.where(np.diff(arr[:, -1]))[0] + 1], axis=0)
>>> result
array([[   6,    8,   10,   12, 4000],
       [  12,    4,    6,    8, 4002],
       [   8,   10,   12,    4, 4004],
       [  14,    6,    8,   10, 4006]])

Nice answer; Even though the one-liner is difficult to read, it's very well explained :) — rafaelc, May 01 '18 at 19:12
Thanks. I think @Divakar's answer is a more robust rendition of the same idea. — Mad Physicist, May 01 '18 at 20:24
choosing this one as the answer as it answers the question using only `numpy`, but truthfully I like the `pandas.groupby` by @MadPhysicist better, it will actually work better for my final solution as I also need to group by month and year. Thanks all. — Infinity Cliff, May 04 '18 at 20:38

Divakar · Answer 3 · 2018-05-01T19:28:59.627

Approach #1 : NumPy based sum-reduction

Here's one based on np.add.reduceat -

def groupbycol(a, assume_sorted_col=False, colID=-1):
    if assume_sorted_col==0:
        # If a is not already sorted by that col, use argsort indices for
        # that colID and re-arrange rows accordingly
        sidx = a[:,colID].argsort()
        a_s = a[sidx] # sorted by colID col of input array
    else:
        a_s = a

    # Get group shifting indices
    cut_idx = np.flatnonzero(np.r_[True, a_s[1:,colID] != a_s[:-1,colID]])

    # Use those indices to setup sum reduction at intervals along first axis
    return np.add.reduceat(a_s, cut_idx, axis=0)

Sample run -

In [64]: arr
Out[64]: 
array([[   1,    2,    3,    4, 2000],
       [   5,    6,    7,    8, 2000],
       [   9,    0,    1,    2, 2001],
       [   3,    4,    5,    6, 2001],
       [   7,    8,    9,    0, 2002],
       [   1,    2,    3,    4, 2002],
       [   5,    6,    7,    8, 2003],
       [   9,    0,    1,    2, 2003]])

In [65]: # Shuffle rows off input array to create a generic last col (not sorted)
    ...: np.random.seed(0)
    ...: np.random.shuffle(arr)

In [66]: arr
Out[66]: 
array([[   5,    6,    7,    8, 2003],
       [   9,    0,    1,    2, 2001],
       [   5,    6,    7,    8, 2000],
       [   9,    0,    1,    2, 2003],
       [   3,    4,    5,    6, 2001],
       [   1,    2,    3,    4, 2000],
       [   1,    2,    3,    4, 2002],
       [   7,    8,    9,    0, 2002]])

In [67]: groupbycol(arr, assume_sorted_col=False, colID=-1)
Out[67]: 
array([[   6,    8,   10,   12, 4000],
       [  12,    4,    6,    8, 4002],
       [   8,   10,   12,    4, 4004],
       [  14,    6,    8,   10, 4006]])

Approach #2 : Leverage matrix-multiplcation

We could basically replace that np.add.reduceat with a broadcasted mask creation + matrix-multiplication, hence leverage the fast BLAS and which also works for a generic not-sorted column -

import pandas as pd

def groupbycol_matmul(a, colID=-1):
    mask = pd.Series(a[:,colID]).unique()[:,None] == arr[:,colID]
    return mask.dot(arr)

Good call on doing the argsort first. – Mad Physicist May 01 '18 at 19:08 — Mad Physicist, May 01 '18 at 19:08
Wish I could give another +1 for the multiplication. – Mad Physicist May 01 '18 at 20:27 — Mad Physicist, May 01 '18 at 20:27

Jacques Gaudin · Answer 4 · 2018-05-01T19:23:37.507

You may want to have a look at numpy_indexed. With it you can do:

import numpy as np
import numpy_indexed as npi

arr = np.array([[1,2,3,4,2000],
                [5,6,7,8,2000],
                [9,0,1,2,2001],
                [3,4,5,6,2001],
                [7,8,9,0,2002],
                [1,2,3,4,2002],
                [5,6,7,8,2003],
                [9,0,1,2,2003]
              ])


result = npi.GroupBy(arr[:, 4]).sum(arr)[1]

>>>[[   6    8   10   12 4000]
    [  12    4    6    8 4002]
    [   8   10   12    4 4004]
    [  14    6    8   10 4006]]

Numpy: conditional sum

4 Answers4

Linked