How to test if all rows are distinct in numpy

Question

In numpy, is there a nice idiomatic way of testing if all rows are distinct in a 2d array?

I thought I could do

len(np.unique(arr)) == len(arr)

but this doesn't work at all. For example,

arr = np.array([[1,2,3],[1,2,4]])
np.unique(arr)
Out[4]: array([1, 2, 3, 4])

Note: http://stackoverflow.com/questions/16970982/find-unique-rows-in-numpy-array is about FINDING the unique row, OP is about TESTING if the rows are all unique. Different questions. — CT Zhu, Oct 02 '14 at 17:10
Several interesting answers to how to drop nonunique rows/columns: http://mail.scipy.org/pipermail/scipy-user/2011-December/031193.html. You can then just see if the reduced array is the same as the original. If you use pandas, there is an efficient implementation to do such a thing. — wflynny, Oct 02 '14 at 17:15
Finding Unique rows would essentially be the same thing as seeing if each row is unique. — GWW, Oct 02 '14 at 19:00
@GWW I think the point is the answer in the linked question might be overkill for the testing problem. In other words there might be a simpler and faster solution to this problem. — Simd, Oct 03 '14 at 15:17

CT Zhu · Accepted Answer · 2014-10-02T19:14:49.580

0

You can calculate the correlation matrix and ask if only the diagonal elements are 1:

(np.corrcoef(M)==1).sum()==M.shape[0]


In [66]:

M = np.random.random((5,8))
In [72]:

(np.corrcoef(M)==1).sum()==M.shape[0]
Out[72]:
True

This if you want to do a similar thing for the columns:

(np.corrcoef(M, rowvar=0)==1).sum()==M.shape[1]

or without numpy at all:

len(set(map(tuple,M)))==len(M)

Fiter out the unique rows and then test if the resultant is same as M is an overkill:

In [99]:

%%timeit

b = np.ascontiguousarray(M).view(np.dtype((np.void, M.dtype.itemsize * M.shape[1])))
_, idx = np.unique(b, return_index=True)

unique_M = M[idx]

unique_M.shape==M.shape
10000 loops, best of 3: 54.6 µs per loop
In [100]:

%timeit len(set(map(tuple,M)))==len(M)
10000 loops, best of 3: 24.9 µs per loop

edited Oct 02 '14 at 19:14

answered Oct 02 '14 at 17:04

CT Zhu

52,648
17
120
133

Thank you very much for this. It's surprising that a non-numpy way is the fastest. Doesn't it have to convert numpy array -> tuple -> set ? – Simd Oct 03 '14 at 08:07
Pure python FTW! If there are many more rows than cols, can try `len(set(tuple(zip(*M.T)))) == len(M)` it might be faster. – Kardo Paska May 05 '20 at 23:49

How to test if all rows are distinct in numpy

1 Answers1