5

I'm trying to add float values of a vector according to integer values from another vector.

for instance if I have:

import numpy as np
a = np.array([0.1,0.2,0.3,0.4,0.5,0.6,07.3,0.8,0.9,1.,1.2,1.4])
b = np.array([0,0,0,0,0,1,1,1,2,2,2,2]).astype(int)

I would like to add the 5 first value of the a vector together (because the 5 first values of b are 0), the 3 next values together (because the 3 next values of b are 1) and so on. So At the end I woudl expect to have

c = function(a,b)
c = [0.1+0.2+0.3+0.4+0.5,  0.6+7.3+0.8, 0.9+1.+1.2+1.4]
jpp
  • 159,742
  • 34
  • 281
  • 339
ymmx
  • 4,769
  • 5
  • 32
  • 64

5 Answers5

4

For a pure numpy solution, you can check the np.diff() of b, which will give you a new array of zeros everywhere except wherever the values change. However, this needs one small tweak as np.diff() reduces the size of your array by one element, so your indices will be off by one. There actually is current development in numpy to make this better (giving new arguments to pad the output back to original size; see the issue here: https://github.com/numpy/numpy/issues/8132)

With that said...here's something that should be instructive:

In [100]: a
Out[100]: array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 7.3, 0.8, 0.9, 1. , 1.2, 1.4])

In [101]: b
Out[101]: array([0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2])

In [102]: np.diff(b) # note it is one element shorter than b
Out[102]: array([0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0])

In [103]: np.flatnonzero(np.diff(b))
Out[103]: array([4, 7]) 

In [104]: np.flatnonzero(np.diff(b)) + 1
Out[104]: array([5, 8])

In [105]: np.insert(np.flatnonzero(np.diff(b)) + 1, 0, 0)
Out[105]: array([0, 5, 8]) # these are the indices of the start of each group

In [106]: indices = _

In [107]: np.add.reduceat(a, indices)
Out[107]: array([1.5, 8.7, 4.5])

In [108]: def sumatchanges(a, b):
     ...:     indices = np.insert(np.flatnonzero(np.diff(b)) + 1, 0, 0)
     ...:     return np.add.reduceat(a, indices)
     ...:

In [109]: sumatchanges(a, b)
Out[109]: array([1.5, 8.7, 4.5])

I would definitely prefer using Pandas groupby as jpp's answer used in most settings, as this is ugly. Hopefully with those changes to numpy, this could be a bit nicer looking and more natural in the future.


Note that this answer is equivalent to the itertools.groupby answer that Maarten gave (in output). Specifically, that is that the groups are assumed to be sequential. I.e., this

b = np.array([0,0,0,0,0,1,1,1,2,2,2,2]).astype(int)

would produce the same output as with

b = np.array([0,0,0,0,0,1,1,1,0,0,0,0]).astype(int)

The number is irrelevant, so long as it changes. However for the other solution Maarten gave, and the pandas solution by jpp, those will sum all the things with the same label, regardless of location. OP is not clear on which you prefer.


Timing:

Here I'll create a random array for summing and a random array of increasing values with 100k entries each, and test both functions time:

In [115]: import timeit
In [116]: import pandas as pd

In [117]: def sumatchangespd(a, b):
     ...:     return pd.Series(a).groupby(b).sum().values
     ...:

In [125]: l = 100_000

In [126]: a = np.random.rand(l)

In [127]: b = np.cumsum(np.random.randint(2, size=l))

In [128]: sumatchanges(a, b)
Out[128]:
array([2.83528234e-01, 6.66182064e-01, 9.32624292e-01, ...,
       2.98379765e-01, 1.97586484e+00, 8.65103445e-04])

In [129]: %timeit sumatchanges(a, b)
1.91 ms ± 47.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [130]: %timeit sumatchangespd(a, b)
6.33 ms ± 267 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Also just to make sure these are equivalent:

In [139]: all(np.isclose(sumatchanges(a, b), sumatchangespd(a, b)))
Out[139]: True

So the numpy version is faster (not too surprising). Again, these functions could do slightly different things though, depending on your input:

In [120]: b  # numpy solution grabs each chunk as a separate piece
Out[120]: array([0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2])

In [121]: b[-4:] = 0

In [122]: b   # pandas will sum the vals in a that have same vals in b
Out[122]: array([0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0])

In [123]: sumatchanges(a, b)
Out[123]: array([1.5, 8.7, 4.5])

In [124]: sumatchangespd(a, b)
Out[124]: array([6. , 8.7])

Divakar's main solution is brilliant and the best out of all of the above speed-wise:

In [144]: def sumatchangesbc(a, b):
     ...:     return np.bincount(b,a)
     ...:

In [145]: %timeit sumatchangesbc(a, b)
175 µs ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Order of magnitude faster than my numpy solution.

alkasm
  • 22,094
  • 5
  • 78
  • 94
  • Great answer. Is this actually more efficient than Pandas? I expect it should be as it doesn't have the overhead with `pd.Series` objects. – jpp Oct 03 '18 at 10:06
  • I was going to time it and see, I'm curious too. `np.insert()` is not a cheap operation though. – alkasm Oct 03 '18 at 10:08
  • @jpp Added the timings---numpy is faster. I actually find myself doing this a lot (detecting state changes via `df['col'].diff()` or `np.diff()`). I like pandas better for it currently because it keeps the same length and you can specify the padding value and such, but since numpy will have it soon, I suppose I'll be fine with either! – alkasm Oct 03 '18 at 10:15
  • Can you try on a larger array (e.g. 100,000 random values)? When timings are in microseconds, I feel they get distorted by fixed costs. (And Pandas certainly has more fixed costs.) – jpp Oct 03 '18 at 10:16
  • 1
    @jpp agreed and fair point, edited with some new arrays. About 3.5x speedup, which seems more reasonable. – alkasm Oct 03 '18 at 10:22
4

Approach #1 : We can make use of np.bincount with b as the bins and a as weights array -

In [203]: np.bincount(b,a)
Out[203]: array([1.5, 8.7, 4.5])

Approach #2 : Another leveraging matrix-multiplication -

In [210]: (b == np.arange(b.max()+1)[:,None]).dot(a)
Out[210]: array([1.5, 8.7, 4.5])
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • 1
    Where do you pull these things from? `bincount`? Marvelous! The matrix multiplication one would be a bad idea for large arrays though. But `bincount` is an order of magnitude faster than my solution. – alkasm Oct 03 '18 at 10:29
  • 1
    Yes that's great. In addition, this is really fast! Thks – ymmx Oct 03 '18 at 11:25
3

You can use Pandas, which is built on NumPy:

import pandas as pd

c = pd.Series(a).groupby(b).sum().values

# array([ 1.5,  8.7,  4.5])

Or a more verbose alternative:

c = pd.DataFrame({'a': a, 'b': b})\
      .groupby('b')['a'].sum().values
jpp
  • 159,742
  • 34
  • 281
  • 339
3

With numpy only

c = [sum(a[b==i]) for i in sorted(set(b))]

note: as @jpp pointed out, it's probably better to write np.unique instead of sorted(set(b))

blue_note
  • 27,712
  • 9
  • 72
  • 90
  • 2
    If `b = np.array([2,2,0,0,0,1,1,1,2,2,2,2]).astype(int)` then output will be wrong right ? – Rahul K P Oct 03 '18 at 09:32
  • @RahulKP: why? what would be the right one? his example has consecutive, ordered values. he does not specific what should happen otherwise – blue_note Oct 03 '18 at 09:35
  • He specified values should be the together. In your case, you are not considering the consecutive values. – Rahul K P Oct 03 '18 at 09:40
  • 1
    See [this answer](https://stackoverflow.com/a/50399219/9209546) for why you should not use built-ins like `sum` with NumPy arrays. – jpp Oct 03 '18 at 10:04
2

a native python solution using itertools.groupby

from itertools import groupby

groups = (group for key, group in groupby(zip(a, b), key=lambda x: x[1]))
totals = [sum(a for a, b in group) for group in groups]
[1.5, 8.7, 4.5]

a numpy alternative, similar to @blue_note's solution, but using the power of numpy instead of native python

[(a * (b == group)).sum() for group in np.unique(b)]

This only works if b = np.array([2,2,0,0,0,1,1,1,2,2,2,2]) does not contain 2 distinctive groups 2

Maarten Fabré
  • 6,938
  • 1
  • 17
  • 36
  • See [this answer](https://stackoverflow.com/a/50399219/9209546) for why you should not use built-ins like `sum` / `zip` with NumPy arrays. – jpp Oct 03 '18 at 10:05
  • The numpy alternative uses `numpy.array.sum`, not the built-in – Maarten Fabré Oct 03 '18 at 10:05
  • Yep, that's true, thought it now has O(*m* * *n*) complexity, where *m* is number of ids and *n* is number of rows of values. That said, I hope my comment encourages people to use the NumPy alternative, that was certainly the intention. – jpp Oct 03 '18 at 10:07