Identify duplicate rows in an array and sum up corresponding values in another array

Question

Suppose there is an array with outcomes and an array with probabilities. It can be the case that some outcomes are listed multiple times. For example:

import numpy as np
x = np.array(([0,0],[1,1],[2,1],[1,1],[2,2]),dtype=int)
p = np.array([0.1,0.2,0.3,0.1,0.2],dtype=float)

Now I would like to list the unique outcomes in x and add up the corresponding probabilities in p of the duplicate outcomes. So the result should be arrays xnew and pnew defined as

xnew = np.array(([0,0],[1,1],[2,1],[2,2]),dtype=int)
pnew = np.array([0.1,0.3,0.3,0.2],dtype=float)

While there are some examples of how to obtain unique rows, see, e.g. Removing duplicate columns and rows from a NumPy 2D array , it is unclear to me how to use this to add up values in the other array.

Anyone have a suggestion? Solutions using numpy are preferred.

Bi Rico · Accepted Answer · 2015-03-25T23:42:11.157

bincount can sum the p array for you, you just need to create a unique id number for every unique row in a. If you're using a sorting approach to identify the unique rows, than creating a unique id is really easy. Once you have sorted the rows generated a diff array, you can just cumsum the diff array. For example:

  x    diff cumsum
[0, 0]  1    1
[0, 0]  0    1
[0, 1]  1    2
[0, 2]  1    3
[1, 0]  1    4
[1, 0]  0    4
[1, 0]  0    4
[1, 0]  0    4
[1, 0]  0    4
[1, 1]  1    5

In code, it looks like this:

import numpy as np

def unique_rows(a, p):
    order = np.lexsort(a.T)
    a = a[order]
    diff = np.ones(len(a), 'bool')
    diff[1:] = (a[1:] != a[:-1]).any(-1)
    sums = np.bincount(diff.cumsum() - 1, p[order])
    return a[diff], sums

score 1 · Answer 2 · answered Apr 02 '16 at 19:08

1

This is a typical grouping problem, which can be solved in a fully vectorized manner using the numpy_indexed package (disclosure: I am its author):

import numpy_indexed as npi
xnew, pnew = npi.group_by(x).sum(p)

answered Apr 02 '16 at 19:08

Eelco Hoogendoorn

10,459
1
44
42

Nice package. This functionaility could be useful to include in numpy itself. – Forzaa Apr 04 '16 at 11:53
Thanks; I initially wrote is as a numpy enhancement proposal, but grappling with all the backwards compatibility questions would probably take longer than id be willing to wait. I do hope this functionality gets backported to numpy eventually though. – Eelco Hoogendoorn Apr 04 '16 at 12:01

score 0 · Answer 3 · answered Mar 23 '15 at 17:08

Not using numpy, but collecting similar value could be done with a dictionary,

import numpy as np
x = np.array(([0,0],[1,1],[2,1],[1,1],[2,2]),dtype=int)
p = np.array([0.1,0.2,0.3,0.1,0.2],dtype=float)

#Initialise dictonary
pdict = {}
for i in x:
    pdict[str(i)] = []

#Collect same values using keys
for i in range(x.shape[0]):
    pdict[str(x[i])].append(p[i])

#Sum over keys
xnew = []; pnew = []
for key, val in pdict.items():
    xnew.append(key)
    pnew.append(np.sum(val))

print('xnew = ',np.array(xnew))
print('pnew = ',np.array(pnew))

I've left the xnew values as strings which could be converted back to lists with some form of split.

This is similar to the solution that I already use and I'm trying to get an alternative for. I'm using defaultdict for this. `pdict = defaultdict(float)` and in the for loop set `pdict[tuple(x[i])] += p[i]`. However, this isn't a vectorized operation, which is preferred for me. — Forzaa, Mar 24 '15 at 09:15

Identify duplicate rows in an array and sum up corresponding values in another array

3 Answers3