Removing rows with duplicates in a NumPy array

Question

I have a (N,3) array of numpy values:

>>> vals = numpy.array([[1,2,3],[4,5,6],[7,8,7],[0,4,5],[2,2,1],[0,0,0],[5,4,3]])
>>> vals
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 7],
       [0, 4, 5],
       [2, 2, 1],
       [0, 0, 0],
       [5, 4, 3]])

I'd like to remove rows from the array that have a duplicate value. For example, the result for the above array should be:

>>> duplicates_removed
array([[1, 2, 3],
       [4, 5, 6],
       [0, 4, 5],
       [5, 4, 3]])

I'm not sure how to do this efficiently with numpy without looping (the array could be quite large). Anyone know how I could do this?

By "without looping" what do you mean? You've got to check every item in the array, so it's O(m*n) no matter what tricks you use to hide the loop. — agf, Sep 15 '11 at 23:14
I think he means looping in Numpy rather than looping in Python. O(m*n) inside a compiled Numpy function is much faster than O(m*n) in a Python `for` loop. When the options are compiled code and interpreted code, constants matter. — Jim Pivarski, Jun 18 '14 at 16:17
[`From your comments`](https://stackoverflow.com/questions/7438438/removing-duplicates-in-each-row-of-a-numpy-array/45136720#comment8994361_7438505), since, you were looking to generalize this to handle generic no. of columns, you might find [`this solution`](https://stackoverflow.com/a/45136720/) to this question worth a read. — Divakar, Jul 17 '17 at 05:47

Benjamin · Accepted Answer · 2011-09-15T23:38:36.780

11

This is an option:

import numpy
vals = numpy.array([[1,2,3],[4,5,6],[7,8,7],[0,4,5],[2,2,1],[0,0,0],[5,4,3]])
a = (vals[:,0] == vals[:,1]) | (vals[:,1] == vals[:,2]) | (vals[:,0] == vals[:,2])
vals = numpy.delete(vals, numpy.where(a), axis=0)

edited Sep 15 '11 at 23:38

answered Sep 15 '11 at 23:12

Benjamin

11,560
13
70
119

I was trying to work this out, good job. But don't you need | not ^ ? – Ned Batchelder Sep 15 '11 at 23:23
1

This is much faster than the list comprehension methods, so I'll probably accept. Wondering if there is any way to generalize to NxM though? – jterrace Sep 15 '11 at 23:27
@Ned Batchelder: yes, although it doesn't change anything in this case. – Benjamin Sep 15 '11 at 23:32
1

@jterrace You could generalize by generating the combinations of 0-m, using them in a generator expression to make the comparisons, then reducing by `|` to get `a`. – agf Sep 16 '11 at 06:08

Divakar · Answer 2 · 2018-05-25T08:02:27.190

3

Here's an approach to handle generic number of columns and still be a vectorized method -

def rows_uniq_elems(a):
    a_sorted = np.sort(a,axis=-1)
    return a[(a_sorted[...,1:] != a_sorted[...,:-1]).all(-1)]

Steps :

Sort along each row.
Look for differences between consecutive elements in each row. Thus, any row with at least one zero differentiation indicates a duplicate element. We will use this to get a mask of valid rows. So, the final step is to simply select valid rows off input array, using the mask.

Sample run -

In [49]: a
Out[49]: 
array([[1, 2, 3, 7],
       [4, 5, 6, 7],
       [7, 8, 7, 8],
       [0, 4, 5, 6],
       [2, 2, 1, 1],
       [0, 0, 0, 3],
       [5, 4, 3, 2]])

In [50]: rows_uniq_elems(a)
Out[50]: 
array([[1, 2, 3, 7],
       [4, 5, 6, 7],
       [0, 4, 5, 6],
       [5, 4, 3, 2]])

edited May 25 '18 at 08:02

answered Jul 17 '17 at 05:44

Divakar

218,885
19
262
358

Out of interest is`np.sort(a)` equivalent to `a[np.arange(idx.shape[0])[:,None], idx]`? – Josmoor98 May 23 '18 at 17:58
1

@EBB Not sure why I was going that indirect way. Updated with that sorting. Thanks for the suggestion! – Divakar May 25 '18 at 08:02
Great, thanks! I was actually reading your answer again just as you uploaded! Freaky! Is `...` the same as `:` in your slicing operation? I haven't seen this implementation before? I'm also curious if there is a difference between using `axis = -1` and `axis = 1`? For my problem, both operations return the same answer? Is there a specific reason for choosing `axis = -1` in your solution? Thanks for your help! – Josmoor98 May 25 '18 at 08:36
1

@EBB That's just a bit more generic as it handle any generic dimension array to remove rows. So, any 2D, 3D, etc array would work now. – Divakar May 25 '18 at 08:38

tellis · Answer 3 · 2018-02-27T13:31:10.000

Its six years on, but this question helped me, so I ran a comparison for speed for the answers given by Divakar, Benjamin, Marcelo Cantos and Curtis Patrick.

import numpy as np
vals = np.array([[1,2,3],[4,5,6],[7,8,7],[0,4,5],[2,2,1],[0,0,0],[5,4,3]])

def rows_uniq_elems1(a):
    idx = a.argsort(1)
    a_sorted = a[np.arange(idx.shape[0])[:,None], idx]
    return a[(a_sorted[:,1:] != a_sorted[:,:-1]).all(-1)]

def rows_uniq_elems2(a):
    a = (a[:,0] == a[:,1]) | (a[:,1] == a[:,2]) | (a[:,0] == a[:,2])
    return np.delete(a, np.where(a), axis=0)

def rows_uniq_elems3(a):
    return np.array([v for v in a if len(set(v)) == len(v)])

def rows_uniq_elems4(a):
    return np.array([v for v in a if len(np.unique(v)) == len(v)])

Results:

%timeit rows_uniq_elems1(vals)
10000 loops, best of 3: 67.9 µs per loop

%timeit rows_uniq_elems2(vals)
10000 loops, best of 3: 156 µs per loop

%timeit rows_uniq_elems3(vals)
1000 loops, best of 3: 59.5 µs per loop

%timeit rows_uniq_elems(vals)
10000 loops, best of 3: 268 µs per loop

It seems that using set beats numpy.unique. In my case I needed to do this over a much larger array:

bigvals = np.random.randint(0,10,3000).reshape([3,1000])

%timeit rows_uniq_elems1(bigvals)
10000 loops, best of 3: 276 µs per loop

%timeit rows_uniq_elems2(bigvals)
10000 loops, best of 3: 192 µs per loop

%timeit rows_uniq_elems3(bigvals)
10000 loops, best of 3: 6.5 ms per loop

%timeit rows_uniq_elems4(bigvals)
10000 loops, best of 3: 35.7 ms per loop

The methods without list comprehensions are much faster. However, the number of rows are hard coded, and are difficult to extend to more than three columns, so in my case at least the list comprehension with the set is the best answer.

EDITED because I confused rows and columns in bigvals

score 2 · Answer 4 · answered Sep 15 '11 at 23:10

2

numpy.array([v for v in vals if len(set(v)) == len(v)])

Mind you, this still loops behind the scenes. You can't avoid that. But it should work fine even for millions of rows.

answered Sep 15 '11 at 23:10

Marcelo Cantos

181,030
38
327
365

I came up with `[item for item in vals if Counter(item).most_common(1)[0][1] is 1]` but that's nicer, especially since you already know `len(v)`. You're still "looping" in that you're iterating over the array, however. – agf Sep 15 '11 at 23:13
This is actually surprisingly fast for a large array though, although I need the index locations of the duplicates, so I like @Benjamin's solution – jterrace Sep 15 '11 at 23:15

score 1 · Answer 5 · edited Sep 15 '11 at 23:17

1

Identical to Marcelo, but I think using numpy.unique() instead of set() may get across exactly what you are shooting for.

numpy.array([v for v in vals if len(numpy.unique(v)) == len(v)])

edited Sep 15 '11 at 23:17

agf

171,228
44
289
238

answered Sep 15 '11 at 23:14

Curtis Patrick

97
7

Well, `set` also gets across the same intent, but is `numpy.unique` faster, perhaps? – Marcelo Cantos Sep 15 '11 at 23:20
It actually seems to be much slower - 23 seconds for numpy.unique() vs. 3 seconds for set() on my machine with 1 million rows – jterrace Sep 15 '11 at 23:24

Removing rows with duplicates in a NumPy array

5 Answers5

Linked

Related