Remove duplicate rows of a numpy array

Question

How can I remove duplicate rows of a 2 dimensional numpy array?

data = np.array([[1,8,3,3,4],
                 [1,8,9,9,4],
                 [1,8,3,3,4]])

The answer should be as follows:

ans = array([[1,8,3,3,4],
             [1,8,9,9,4]])

If there are two rows that are the same, then I would like to remove one "duplicate" row.

Is it okay if the rows are not in that order orginally present in input array? — Divakar, Jun 28 '15 at 07:39
My problem is very similar to yours. [Look here][1] [1]: http://stackoverflow.com/questions/31093261/python-routine-to-extract-linear-independent-rows-from-a-rank-deficient-matrix/31093331?noredirect=1#comment50210205_31093331 — Simone Bolognini, Jun 28 '15 at 07:42
I believe now you can apply ```np.unique``` over an axis, so ```np.unique(data, axis = 0)``` works. — Austin Garrett, Jan 30 '18 at 20:57

score 101 · Accepted Answer · edited Oct 01 '20 at 10:28

101

You can use numpy unique. Since you want the unique rows, we need to put them into tuples:

import numpy as np

data = np.array([[1,8,3,3,4],
                 [1,8,9,9,4],
                 [1,8,3,3,4]])

just applying np.unique to the data array will result in this:

>>> uniques
array([1, 3, 4, 8, 9])

prints out the unique elements in the list. So putting them into tuples results in:

new_array = [tuple(row) for row in data]
uniques = np.unique(new_array)

which prints:

>>> uniques
array([[1, 8, 3, 3, 4],
       [1, 8, 9, 9, 4]])

UPDATE

In the new version, you need to set np.unique(data, axis=0)

edited Oct 01 '20 at 10:28

Belphegor

4,456
11
34
59

answered Jun 28 '15 at 07:44

Srivatsan

9,225
13
58
83

22

I tried `new_array = [tuple(row) for row in data] uniques = np.unique(new_array)` but it still output uniques `array([1, 3, 4, 8, 9])` @ThePredator – Owen Jun 27 '16 at 03:33
@Owen: That cannot be possible, check your code once more. – Srivatsan Jun 27 '16 at 08:25
3

Here is the code, I used the same code as your show: `import numpy as np data = np.array([[1,8,3,3,4], [1,8,9,9,4], [1,8,3,3,4]]) new_array = [tuple(row) for row in data] uniques = np.unique(new_array) uniques Out[30]: array([1, 3, 4, 8, 9])` Is that anything about the numpy version? my numpy version is 1.9.2 – Owen Jun 27 '16 at 14:18
@Owen: I have no idea! – Srivatsan Jun 27 '16 at 15:05
In numpy 1.9.0, the help states that it flattens the input, however omerbp's solution works. – wassname Jan 24 '17 at 01:30
A couple of years late, but It doesn't work for me ```y = [np.random.randint(0, 3, 2) for i in range(20)] new_y = [tuple(element) for element in y] set(new_y)``` or if you need it as a list: `list(set(new_y))` – JackS Mar 11 '17 at 18:54
2

I think the following is the right answer http://stackoverflow.com/questions/16970982/find-unique-rows-in-numpy-array – TommasoF Apr 12 '17 at 17:22
24

In the new version, you need to set `np.unique(data, axis=0)` – Justas May 28 '19 at 00:40
I think that this was for an outdated version. It is better to use what @Justas put . – Gabriel Avendaño Sep 27 '20 at 17:13
Note that Divakar's `lexsort` solution is still the fastest presented here (at least for this example). – Marius Wallraff Jul 22 '22 at 14:18

Divakar · Answer 2 · 2015-06-28T08:27:48.653

One approach with lex-sorting -

# Perform lex sort and get sorted data
sorted_idx = np.lexsort(data.T)
sorted_data =  data[sorted_idx,:]

# Get unique row mask
row_mask = np.append([True],np.any(np.diff(sorted_data,axis=0),1))

# Get unique rows
out = sorted_data[row_mask]

Sample run -

In [199]: data
Out[199]: 
array([[1, 8, 3, 3, 4],
       [1, 8, 9, 9, 4],
       [1, 8, 3, 3, 4],
       [1, 8, 3, 3, 4],
       [1, 8, 0, 3, 4],
       [1, 8, 9, 9, 4]])

In [200]: sorted_idx = np.lexsort(data.T)
     ...: sorted_data =  data[sorted_idx,:]
     ...: row_mask = np.append([True],np.any(np.diff(sorted_data,axis=0),1))
     ...: out = sorted_data[row_mask]
     ...: 

In [201]: out
Out[201]: 
array([[1, 8, 0, 3, 4],
       [1, 8, 3, 3, 4],
       [1, 8, 9, 9, 4]])

Runtime tests -

This section times all approaches proposed in the solutions presented thus far.

In [34]: data = np.random.randint(0,10,(10000,10))

In [35]: def tuple_based(data):
    ...:     new_array = [tuple(row) for row in data]
    ...:     return np.unique(new_array)
    ...: 
    ...: def lexsort_based(data):                 
    ...:     sorted_data =  data[np.lexsort(data.T),:]
    ...:     row_mask = np.append([True],np.any(np.diff(sorted_data,axis=0),1))
    ...:     return sorted_data[row_mask]
    ...: 
    ...: def unique_based(a):
    ...:     a = np.ascontiguousarray(a)
    ...:     unique_a = np.unique(a.view([('', a.dtype)]*a.shape[1]))
    ...:     return unique_a.view(a.dtype).reshape((unique_a.shape[0], a.shape[1]))
    ...: 

In [36]: %timeit tuple_based(data)
10 loops, best of 3: 63.1 ms per loop

In [37]: %timeit lexsort_based(data)
100 loops, best of 3: 8.92 ms per loop

In [38]: %timeit unique_based(data)
10 loops, best of 3: 29.1 ms per loop

f.y.i.: `unique_based` is about twice as fast as `np.unique(data, axis=0)`, so lexsort is still preferable in 2022. — Marius Wallraff, Jul 22 '22 at 14:17

score 8 · Answer 3 · edited May 23 '17 at 12:32

A simple solution can be:

import numpy as np
def unique_rows(a):
    a = np.ascontiguousarray(a)
    unique_a = np.unique(a.view([('', a.dtype)]*a.shape[1]))
    return unique_a.view(a.dtype).reshape((unique_a.shape[0], a.shape[1]))

data = np.array([[1,8,3,3,4],
                 [1,8,9,9,4],
                 [1,8,3,3,4]])


print unique_rows(data)
#prints:
[[1 8 3 3 4]
 [1 8 9 9 4]]

You can check this for many more solutions for this problem

Remove duplicate rows of a numpy array

3 Answers3

Linked

Related