13

I have a (N,3) array of numpy values:

>>> vals = numpy.array([[1,2,3],[4,5,6],[7,8,7],[0,4,5],[2,2,1],[0,0,0],[5,4,3]])
>>> vals
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 7],
       [0, 4, 5],
       [2, 2, 1],
       [0, 0, 0],
       [5, 4, 3]])

I'd like to remove rows from the array that have a duplicate value. For example, the result for the above array should be:

>>> duplicates_removed
array([[1, 2, 3],
       [4, 5, 6],
       [0, 4, 5],
       [5, 4, 3]])

I'm not sure how to do this efficiently with numpy without looping (the array could be quite large). Anyone know how I could do this?

Divakar
  • 218,885
  • 19
  • 262
  • 358
jterrace
  • 64,866
  • 22
  • 157
  • 202
  • By "without looping" what do you mean? You've got to check every item in the array, so it's O(m*n) no matter what tricks you use to hide the loop. – agf Sep 15 '11 at 23:14
  • 1
    I think he means looping in Numpy rather than looping in Python. O(m*n) inside a compiled Numpy function is much faster than O(m*n) in a Python `for` loop. When the options are compiled code and interpreted code, constants matter. – Jim Pivarski Jun 18 '14 at 16:17
  • [`From your comments`](https://stackoverflow.com/questions/7438438/removing-duplicates-in-each-row-of-a-numpy-array/45136720#comment8994361_7438505), since, you were looking to generalize this to handle generic no. of columns, you might find [`this solution`](https://stackoverflow.com/a/45136720/) to this question worth a read. – Divakar Jul 17 '17 at 05:47

5 Answers5

11

This is an option:

import numpy
vals = numpy.array([[1,2,3],[4,5,6],[7,8,7],[0,4,5],[2,2,1],[0,0,0],[5,4,3]])
a = (vals[:,0] == vals[:,1]) | (vals[:,1] == vals[:,2]) | (vals[:,0] == vals[:,2])
vals = numpy.delete(vals, numpy.where(a), axis=0)
Benjamin
  • 11,560
  • 13
  • 70
  • 119
  • I was trying to work this out, good job. But don't you need | not ^ ? – Ned Batchelder Sep 15 '11 at 23:23
  • 1
    This is much faster than the list comprehension methods, so I'll probably accept. Wondering if there is any way to generalize to NxM though? – jterrace Sep 15 '11 at 23:27
  • @Ned Batchelder: yes, although it doesn't change anything in this case. – Benjamin Sep 15 '11 at 23:32
  • 1
    @jterrace You could generalize by generating the combinations of 0-m, using them in a generator expression to make the comparisons, then reducing by `|` to get `a`. – agf Sep 16 '11 at 06:08
3

Here's an approach to handle generic number of columns and still be a vectorized method -

def rows_uniq_elems(a):
    a_sorted = np.sort(a,axis=-1)
    return a[(a_sorted[...,1:] != a_sorted[...,:-1]).all(-1)]

Steps :

  • Sort along each row.

  • Look for differences between consecutive elements in each row. Thus, any row with at least one zero differentiation indicates a duplicate element. We will use this to get a mask of valid rows. So, the final step is to simply select valid rows off input array, using the mask.

Sample run -

In [49]: a
Out[49]: 
array([[1, 2, 3, 7],
       [4, 5, 6, 7],
       [7, 8, 7, 8],
       [0, 4, 5, 6],
       [2, 2, 1, 1],
       [0, 0, 0, 3],
       [5, 4, 3, 2]])

In [50]: rows_uniq_elems(a)
Out[50]: 
array([[1, 2, 3, 7],
       [4, 5, 6, 7],
       [0, 4, 5, 6],
       [5, 4, 3, 2]])
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Out of interest is`np.sort(a)` equivalent to `a[np.arange(idx.shape[0])[:,None], idx]`? – Josmoor98 May 23 '18 at 17:58
  • 1
    @EBB Not sure why I was going that indirect way. Updated with that sorting. Thanks for the suggestion! – Divakar May 25 '18 at 08:02
  • Great, thanks! I was actually reading your answer again just as you uploaded! Freaky! Is `...` the same as `:` in your slicing operation? I haven't seen this implementation before? I'm also curious if there is a difference between using `axis = -1` and `axis = 1`? For my problem, both operations return the same answer? Is there a specific reason for choosing `axis = -1` in your solution? Thanks for your help! – Josmoor98 May 25 '18 at 08:36
  • 1
    @EBB That's just a bit more generic as it handle any generic dimension array to remove rows. So, any 2D, 3D, etc array would work now. – Divakar May 25 '18 at 08:38
2

Its six years on, but this question helped me, so I ran a comparison for speed for the answers given by Divakar, Benjamin, Marcelo Cantos and Curtis Patrick.

import numpy as np
vals = np.array([[1,2,3],[4,5,6],[7,8,7],[0,4,5],[2,2,1],[0,0,0],[5,4,3]])

def rows_uniq_elems1(a):
    idx = a.argsort(1)
    a_sorted = a[np.arange(idx.shape[0])[:,None], idx]
    return a[(a_sorted[:,1:] != a_sorted[:,:-1]).all(-1)]

def rows_uniq_elems2(a):
    a = (a[:,0] == a[:,1]) | (a[:,1] == a[:,2]) | (a[:,0] == a[:,2])
    return np.delete(a, np.where(a), axis=0)

def rows_uniq_elems3(a):
    return np.array([v for v in a if len(set(v)) == len(v)])

def rows_uniq_elems4(a):
    return np.array([v for v in a if len(np.unique(v)) == len(v)])

Results:

%timeit rows_uniq_elems1(vals)
10000 loops, best of 3: 67.9 µs per loop

%timeit rows_uniq_elems2(vals)
10000 loops, best of 3: 156 µs per loop

%timeit rows_uniq_elems3(vals)
1000 loops, best of 3: 59.5 µs per loop

%timeit rows_uniq_elems(vals)
10000 loops, best of 3: 268 µs per loop

It seems that using set beats numpy.unique. In my case I needed to do this over a much larger array:

bigvals = np.random.randint(0,10,3000).reshape([3,1000])

%timeit rows_uniq_elems1(bigvals)
10000 loops, best of 3: 276 µs per loop

%timeit rows_uniq_elems2(bigvals)
10000 loops, best of 3: 192 µs per loop

%timeit rows_uniq_elems3(bigvals)
10000 loops, best of 3: 6.5 ms per loop

%timeit rows_uniq_elems4(bigvals)
10000 loops, best of 3: 35.7 ms per loop

The methods without list comprehensions are much faster. However, the number of rows are hard coded, and are difficult to extend to more than three columns, so in my case at least the list comprehension with the set is the best answer.

EDITED because I confused rows and columns in bigvals

tellis
  • 152
  • 6
2
numpy.array([v for v in vals if len(set(v)) == len(v)])

Mind you, this still loops behind the scenes. You can't avoid that. But it should work fine even for millions of rows.

Marcelo Cantos
  • 181,030
  • 38
  • 327
  • 365
  • I came up with `[item for item in vals if Counter(item).most_common(1)[0][1] is 1]` but that's nicer, especially since you already know `len(v)`. You're still "looping" in that you're iterating over the array, however. – agf Sep 15 '11 at 23:13
  • This is actually surprisingly fast for a large array though, although I need the index locations of the duplicates, so I like @Benjamin's solution – jterrace Sep 15 '11 at 23:15
1

Identical to Marcelo, but I think using numpy.unique() instead of set() may get across exactly what you are shooting for.

numpy.array([v for v in vals if len(numpy.unique(v)) == len(v)])
agf
  • 171,228
  • 44
  • 289
  • 238