2

Is there any function to do a groupby sum on numpy array?

maybe duplicated with this

x = np.array([[1.2, 10],
              [2.3, 20],
              [1.2, 30],
              [2.3, 7]
            ])

wanted output:

x = np.array([[1.2, 40],
              [2.3, 27]            
            ])

Update:

Actually, the first column of my data is always rounded to two decimals. So x can be written as:

x = np.array([[120, 10],
              [230, 20],
              [120, 30],
              [230, 7]
            ])
Ju Piece
  • 249
  • 3
  • 9
  • 2
    What is that you have tried ? – sushanth Sep 05 '20 at 03:36
  • Be aware that values coming from real data are very rarely equal to each other, so it's actually extremely unlikely for you to have duplicate floating-point keys. Depending on what this is for, you may want to use something more like a histogram instead. – abatea Sep 05 '20 at 04:21

4 Answers4

2

I wouldn't say this is duplicate but related question you mentioned is a good point to start with. A majority of answers of your link requires to sort array, extract indices where groups begin and then call np.split on it. That's not a case here because it would return a list of groups that are not balanced in size.

Instead you can use np.bincount method. It counts number of occurrences of each weighted value and this is actually the same as groupby sum, only group keys are absent from output.

def group_by_sum(x):
    u, idx = np.unique(x[:,0], return_inverse=True)
    s = np.bincount(idx, weights = x[:,1])
    return np.c_[u, s]

Bonus. It's actually a oneliner in numpy_indexed package:

np.transpose(npi.group_by(x[:, 0]).sum(x[:, 1]))

Benchmarking

import numpy as np
import perfplot
import matplotlib.pyplot as plt

def bincount(x):
    u, idx = np.unique(x[:,0], return_inverse=True)
    s = np.bincount(idx, weights = x[:,1])
    return np.c_[u, s]

def reduceat(x):
    x = x[np.argsort(x[:, 0])]
    i = np.flatnonzero(np.diff(x[:, 0]))
    i = np.r_[0, i + 1]
    s = np.add.reduceat(x[:, 1], i)
    return np.stack((x[i, 0], s), axis=-1)

def setup(N, s):
    x = np.linspace(0,1,N+1)[np.random.randint(N, size = s)]
    return np.c_[x, (x**2)%1]

def build_args(k):
    return {'setup': lambda x: setup(k, x),
            'kernels': [bincount, reduceat],
            'n_range': [2**k for k in range(1, 20)],
            'title': f'Testing for x samples in [0, 1] with no more than {k} groups',
            'show_progress': True,
            'equality_check': False}

outs = [perfplot.bench(**build_args(n)) for n in (10, 100, 1000, 10000)]
fig = plt.figure(figsize=(20, 20))
for i in range(len(outs)):
    ax = fig.add_subplot(2, 2, i + 1)
    ax.grid(True, which="both")
    outs[i].plot()
plt.show()

enter image description here

mathfux
  • 5,759
  • 1
  • 14
  • 34
  • You're welcome. Don't know how this method works internally though but `np.add.reduceat` is a nice feature too. – mathfux Sep 05 '20 at 06:02
  • `unique` works internally by sorting, so both methods have the same complexity. Do you have the resources to benchmark them against each other? – Mad Physicist Sep 05 '20 at 13:40
  • @Mad Physicist. Good catch. It's interesting for me too but I'm busy with another problem at the moment so I'll benchmark it in forthcoming day and update an answer. – mathfux Sep 05 '20 at 13:45
  • They are pretty much the same. – mathfux Sep 07 '20 at 05:00
1

Here is a solution using unique values to count repetitions for each element and multiply it by its value to calculate the groupby sum (You can achieve it faster by implementing a hashmap of O(n) that only counts the repetitions and unique values):

EDIT since original question is edited:

keys2, idx, count = np.unique(x[:,0], return_counts=True, return_index=True)
values2 = x[:,1][idx]*count

Another way is using pandas groupby:

df = pd.DataFrame({'keys':x[:,0], 'values':x[:,1]})
df2 = df.groupby(keys)['values'].agg('sum')
keys2, values2 = df2.index.to_numpy(), df2.values

output:

[1.2 2.3] 
[20 30]
Ehsan
  • 12,072
  • 2
  • 20
  • 33
  • @MadPhysicist You mean `return_index`? that makes sense. Thank you. Edited the post. – Ehsan Sep 05 '20 at 08:53
  • @MadPhysicist I see. Thank you for correcting me. For this specific example, however, we can still accomplish the task with a hashmap with `O(n)` without the need to know which elements are the same. Only counts and the values are needed. – Ehsan Sep 09 '20 at 01:50
1

Numpy provides the tools to do this without explicit looping.

First sort the rows:

a = a[np.argsort(a[:, 0])]

Then find the indices where the value changes:

i = np.flatnonzero(np.diff(a[:, 0]))
i = np.r_[0, i + 1]

Then add up the elements:

s = np.add.reduceat(a[:, 1], i)

The index is just the first element of a in each run, so the result is

result = np.stack((a[i, 0], s), axis=-1)
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • Is sorting necessary for this problem? I think it is costlier than it needs to be. – Ehsan Sep 05 '20 at 08:54
  • @Ehsan. Yes. That's how `unique` works internally. Otherwise you won't know which elements are the same. It's more complex than a hash table, but I suspect that it's faster than a python `set` for most practically-sized applications. – Mad Physicist Sep 05 '20 at 13:41
  • Thank you. Is it necessary to know which elements are the same? Isn't count and the unique values enough (like I used in my post)? – Ehsan Sep 09 '20 at 01:52
0

Here's a method

d = {}
for k,v in x:
    d[k] = d.get(k,0) + v

x = np.array(list(d.items()))

keep in mind that this is testing for float equality... something you probably shouldn't do

abatea
  • 81
  • 1
  • 8