5

In numpy, is there a nice idiomatic way of testing if all rows are distinct in a 2d array?

I thought I could do

len(np.unique(arr)) == len(arr)

but this doesn't work at all. For example,

arr = np.array([[1,2,3],[1,2,4]])
np.unique(arr)
Out[4]: array([1, 2, 3, 4])
Simd
  • 19,447
  • 42
  • 136
  • 271
  • Note: http://stackoverflow.com/questions/16970982/find-unique-rows-in-numpy-array is about FINDING the unique row, OP is about TESTING if the rows are all unique. Different questions. – CT Zhu Oct 02 '14 at 17:10
  • Several interesting answers to how to drop nonunique rows/columns: http://mail.scipy.org/pipermail/scipy-user/2011-December/031193.html. You can then just see if the reduced array is the same as the original. If you use pandas, there is an efficient implementation to do such a thing. – wflynny Oct 02 '14 at 17:15
  • @GWW Isn't the question different as CT Zhu pointed out? – Simd Oct 02 '14 at 17:53
  • Finding Unique rows would essentially be the same thing as seeing if each row is unique. – GWW Oct 02 '14 at 19:00
  • @GWW I think the point is the answer in the linked question might be overkill for the testing problem. In other words there might be a simpler and faster solution to this problem. – Simd Oct 03 '14 at 15:17

1 Answers1

0

You can calculate the correlation matrix and ask if only the diagonal elements are 1:

(np.corrcoef(M)==1).sum()==M.shape[0]


In [66]:

M = np.random.random((5,8))
In [72]:

(np.corrcoef(M)==1).sum()==M.shape[0]
Out[72]:
True

This if you want to do a similar thing for the columns:

(np.corrcoef(M, rowvar=0)==1).sum()==M.shape[1]

or without numpy at all:

len(set(map(tuple,M)))==len(M)

Fiter out the unique rows and then test if the resultant is same as M is an overkill:

In [99]:

%%timeit

b = np.ascontiguousarray(M).view(np.dtype((np.void, M.dtype.itemsize * M.shape[1])))
_, idx = np.unique(b, return_index=True)

unique_M = M[idx]

unique_M.shape==M.shape
10000 loops, best of 3: 54.6 µs per loop
In [100]:

%timeit len(set(map(tuple,M)))==len(M)
10000 loops, best of 3: 24.9 µs per loop
CT Zhu
  • 52,648
  • 17
  • 120
  • 133
  • Thank you very much for this. It's surprising that a non-numpy way is the fastest. Doesn't it have to convert numpy array -> tuple -> set ? – Simd Oct 03 '14 at 08:07
  • Pure python FTW! If there are many more rows than cols, can try `len(set(tuple(zip(*M.T)))) == len(M)` it might be faster. – Kardo Paska May 05 '20 at 23:49