Finding repeated rows in a numpy array

Question

The following function is designed to find the unique rows of an array:

def unique_rows(a):
    b = np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
    _, idx = np.unique(b, return_index=True)
    unique_a = a[idx]
    return unique_a

For example,

test = np.array([[1,0,1],[1,1,1],[1,0,1]])
unique_rows(test)
[[1,0,1],[1,1,1]]

I believe that this function should work all the time, however it may not be watertight. In my code I would like to calculate how many unique positions exist for a set of particles. The particles are stored in a 2d array, each row corresponding to the position of a particle. The positions are of type np.float64. I have also defined the following function

def pos_tag(pos):
    x,y,z = pos[:,0],pos[:,1],pos[:,2]
    return (2**x)*(3**y)*(5**z)

In principle this function should produce a unique value for any (x,y,z) position.

However, when I use these to functions to calculate the number of unique positions in my set of particles they produce different answers. Is this due to some possible logical flaw in the first function, or the second function not producing a unique value for each given position?

EDIT: Usage example

I have some long code that produces a 2d array of particle postions.

partpos.shape = (6039539,3)

I then calculate the number of unique rows as follows

len(unqiue_rows(partpos))
6034411

And

posids = pos_tag(partpos)
len(np.unique(posids))
5328871

```pos[:,0]``` *identifies* the first column, if you want the first row it would be ```pos[0,:]```. — wwii, Oct 20 '16 at 15:08
pos_tag will produce a 1d array, whose length is equal to the number of particles. pos[:,0] identifies the x coordinate of each particle, so that when the unique value is calculated, the operation is performed on each position at the same time — Jack, Oct 20 '16 at 15:11
numpy and python handle floats differently; I think that is the crux here. Try rounding down your floats a little to see if that makes a difference — Eelco Hoogendoorn, Oct 20 '16 at 15:14
Can you give an example case of when you don't get the expected results? — B. Eckles, Oct 20 '16 at 15:22
Can you also explain exactly how you are using the second function to determine the number of unique positions? — B. Eckles, Oct 20 '16 at 15:24
Why is `partpos.shape[1]` equal to 4? The example you gave above shows each row of length 3... — B. Eckles, Oct 20 '16 at 15:35
Sorry, the 4th value in each row is the mass of the particle, which is extraneous to this problem. — Jack, Oct 20 '16 at 15:37
Unless I'm missing something, isn't your `unique_rows` function including that mass in the decision about whether two rows are unique? Eg. [1,1,1,2] is not the same as [1,1,1,3], but they have the same position. — B. Eckles, Oct 20 '16 at 15:39
Sorry for the lack of clarity, I actually defined partpos to just be the positions of the particles, it has 3 columns, not 4. — Jack, Oct 20 '16 at 15:42
For, pos_tag, did you work from a proof developed beforehand or did you wing it using intuition? — wwii, Oct 20 '16 at 15:53
I took inspiration from this question when writing pos_tag http://math.stackexchange.com/questions/1176184/how-to-find-unique-numbers-from-3-numbers — Jack, Oct 20 '16 at 15:57
Note that the linked question is talking about integers, not floats. The probability of an collision is still low, but it's not the same as zero. — B. Eckles, Oct 20 '16 at 15:58
Author of http://math.stackexchange.com/a/1176241/380807 stated it is not designed to work with floats. — wwii, Oct 20 '16 at 16:18
I imagine you also picked some code from http://stackoverflow.com/q/16970982/2823755 q&a - did you try pandas DataFrame.drop_duplicates? — wwii, Oct 20 '16 at 17:43
```d = collections.Counter(map(str, test))``` or ```d = collections.Counter(str(thing) for thing in a)``` - might help but, it might be a bit slow. Or even ```len(np.unique(np.apply_along_axis(str, 1, a)))``` — wwii, Oct 20 '16 at 17:49

score 1 · Answer 1 · answered Oct 20 '16 at 16:00

I believe that the discrepancy arises due to a precision error. Using the code

print len(unique_rows(partpos.astype(np.float32)))
print len(np.unique(pos_tag(partpos)))

6034411
6034411

However with

print len(unique_rows(partpos.astype(np.float32)))
print len(np.unique(pos_tag(partpos.astype(np.float32))))

6034411
5328871

B. Eckles · Answer 2 · 2016-10-20T15:14:49.117

-1

a = [[1,0,1],[1,1,1],[1,0,1]]

# Convert rows to tuples so they're hashable, creating a generator thereof
b = (tuple(row) for row in a)

# Convert back to list of lists, after coercing to a set to eliminate non-unique rows
unique_rows = list(list(row) for row in set(b))

Edit: Well that's embarrassing. I just realized I didn't really address the question asked. This could still be the answer the OP is looking for, so I'll leave it, but it's not really what was asked. Sorry for that.

edited Oct 20 '16 at 15:14

answered Oct 20 '16 at 15:08

B. Eckles

1,626
2
15
27

Is this method of finding unique rows independent of the two methods I described in the question? If so it could be useful for testing. – Jack Oct 20 '16 at 15:19
Yes, it's a more directly Pythonic way of accomplishing the same thing. Only main issue is that it doesn't ensure the same order of rows every time you run it. You could enforce an order by sorting afterward, among other methods. – B. Eckles Oct 20 '16 at 15:22

Finding repeated rows in a numpy array

2 Answers2