1

Suppose there is an array with outcomes and an array with probabilities. It can be the case that some outcomes are listed multiple times. For example:

import numpy as np
x = np.array(([0,0],[1,1],[2,1],[1,1],[2,2]),dtype=int)
p = np.array([0.1,0.2,0.3,0.1,0.2],dtype=float)

Now I would like to list the unique outcomes in x and add up the corresponding probabilities in p of the duplicate outcomes. So the result should be arrays xnew and pnew defined as

xnew = np.array(([0,0],[1,1],[2,1],[2,2]),dtype=int)
pnew = np.array([0.1,0.3,0.3,0.2],dtype=float)

While there are some examples of how to obtain unique rows, see, e.g. Removing duplicate columns and rows from a NumPy 2D array , it is unclear to me how to use this to add up values in the other array.

Anyone have a suggestion? Solutions using numpy are preferred.

Community
  • 1
  • 1
Forzaa
  • 1,465
  • 4
  • 15
  • 27

3 Answers3

1

bincount can sum the p array for you, you just need to create a unique id number for every unique row in a. If you're using a sorting approach to identify the unique rows, than creating a unique id is really easy. Once you have sorted the rows generated a diff array, you can just cumsum the diff array. For example:

  x    diff cumsum
[0, 0]  1    1
[0, 0]  0    1
[0, 1]  1    2
[0, 2]  1    3
[1, 0]  1    4
[1, 0]  0    4
[1, 0]  0    4
[1, 0]  0    4
[1, 0]  0    4
[1, 1]  1    5

In code, it looks like this:

import numpy as np

def unique_rows(a, p):
    order = np.lexsort(a.T)
    a = a[order]
    diff = np.ones(len(a), 'bool')
    diff[1:] = (a[1:] != a[:-1]).any(-1)
    sums = np.bincount(diff.cumsum() - 1, p[order])
    return a[diff], sums
Bi Rico
  • 25,283
  • 3
  • 52
  • 75
1

This is a typical grouping problem, which can be solved in a fully vectorized manner using the numpy_indexed package (disclosure: I am its author):

import numpy_indexed as npi
xnew, pnew = npi.group_by(x).sum(p)
Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
  • Nice package. This functionaility could be useful to include in numpy itself. – Forzaa Apr 04 '16 at 11:53
  • Thanks; I initially wrote is as a numpy enhancement proposal, but grappling with all the backwards compatibility questions would probably take longer than id be willing to wait. I do hope this functionality gets backported to numpy eventually though. – Eelco Hoogendoorn Apr 04 '16 at 12:01
0

Not using numpy, but collecting similar value could be done with a dictionary,

import numpy as np
x = np.array(([0,0],[1,1],[2,1],[1,1],[2,2]),dtype=int)
p = np.array([0.1,0.2,0.3,0.1,0.2],dtype=float)

#Initialise dictonary
pdict = {}
for i in x:
    pdict[str(i)] = []

#Collect same values using keys
for i in range(x.shape[0]):
    pdict[str(x[i])].append(p[i])

#Sum over keys
xnew = []; pnew = []
for key, val in pdict.items():
    xnew.append(key)
    pnew.append(np.sum(val))

print('xnew = ',np.array(xnew))
print('pnew = ',np.array(pnew))

I've left the xnew values as strings which could be converted back to lists with some form of split.

Ed Smith
  • 12,716
  • 2
  • 43
  • 55
  • This is similar to the solution that I already use and I'm trying to get an alternative for. I'm using defaultdict for this. `pdict = defaultdict(float)` and in the for loop set `pdict[tuple(x[i])] += p[i]`. However, this isn't a vectorized operation, which is preferred for me. – Forzaa Mar 24 '15 at 09:15