3

I have a 2d numpy array of bools, and I'd like to know how many unique rows my data set contains and the frequency of each row. The only way I could solve this problem is by converting my whole data set into a string and then do the comparison, but surely there must be a better way to do this. Any help is appreciated.

def getUniqueHaplotypes(self,data):
nHap=data.shape[0]
unique=dict() 
for i in range(nHap):
    s = "".join([str(j) for j in data[i]])
    if unique.has_key(s):
        unique[s]+=1
    else:
        unique[s] = 1

return unique
  • I don't think that your way is so bad, though I would use tuples of the rows as keys rather than converting the rows to strings. That said, I think that Joe Kington's method is quite good. – Justin Peel Oct 13 '10 at 04:19
  • I'll second what Justin said: There's nothing wrong with the way you're already doing things. In fact, if you use tuples as Justin suggested and iterate directly over the rows of the array (`for row in data:`), it's actually faster than my method below. – Joe Kington Oct 13 '10 at 16:25
  • You can get a lot of good ideas for solutions from http://stackoverflow.com/questions/16970982/find-unique-rows-in-numpy-array – j08lue Dec 02 '16 at 12:25

2 Answers2

2

Look into numpy.unique and numpy.bincount.

E.g.

import numpy as np
x = (np.random.random(100) * 5).astype(np.int)
unique_vals, indicies = np.unique(x, return_inverse=True)
counts = np.bincount(indicies)

print unique_vals, counts

Edit: Sorry, I misread your question...

One way to get the unique rows is to view things as a structured array...

In your case, you have a 2D array of bools. So maybe something like this?

import numpy as np
numrows, numcols = 10,3
x = np.random.random((numrows, numcols)) > 0.5
x = x.view(','.join(numcols * ['i1'])) # <- View the rows as a 1D structured array...

unique_vals, indicies = np.unique(x, return_inverse=True)
counts = np.bincount(indicies)

print unique_vals, counts

Of course, there's nothing really wrong with the way you were originally doing it... Just to show a slightly cleaner way to write your original function (Using tuples, as Justin suggested):

def unique_rows(data):
    unique = dict()
    for row in data:
        row = tuple(row)
        if row in unique:
            unique[row] += 1
        else:
            unique[row] = 1
    return unique

We can take this one step farther and use a defaultdict:

from collections import defaultdict
def unique_rows(data):
    unique = defaultdict(int)
    for row in data:
        unique[tuple(row)] += 1
    return unique

As it happens, either of these options appears to be faster than the "numpy-thonic" way of doing it... (I would have guessed the opposite! Converting the rows to strings as you did in your original example is slow, though. You definitely want to compare tuples instead of strings).

Joe Kington
  • 275,208
  • 71
  • 604
  • 463
0

I like the solution which is helpful:

def unique_rows(data):
    unique = dict()
    for row in data:
        row = tuple(row)
        if row in unique:
            unique[row] += 1
        else:
            unique[row] = 1
    return unique

It is very fast. My only concern is: it possible to perform the same using unique as an array and not as dict()? I'm getting in trouble to print unique dictionary without the dictionary format. Thanks Giuseppe

Sandip Armal Patil
  • 6,241
  • 21
  • 93
  • 160
  • How fast is this for an incredibly large array, it seems like you are copying EVERYTHING, rather than looking at it in-place... (so I would guess it was slow?) – Andy Hayden Sep 27 '12 at 12:37