Removing duplicate columns and rows from a NumPy 2D array

Question

I'm using a 2D shape array to store pairs of longitudes+latitudes. At one point, I have to merge two of these 2D arrays, and then remove any duplicated entry. I've been searching for a function similar to numpy.unique, but I've had no luck. Any implementation I've been thinking on looks very "unoptimizied". For example, I'm trying with converting the array to a list of tuples, removing duplicates with set, and then converting to an array again:

coordskeys = np.array(list(set([tuple(x) for x in coordskeys])))

Are there any existing solutions, so I do not reinvent the wheel?

To make it clear, I'm looking for:

>>> a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]])
>>> unique_rows(a)
array([[1, 1], [2, 3],[5, 4]])

BTW, I wanted to use just a list of tuples for it, but the lists were so big that they consumed my 4Gb RAM + 4Gb swap (numpy arrays are more memory efficient).

See http://stackoverflow.com/questions/7989722/finding-unique-points-in-numpy-array — joris, Dec 19 '11 at 13:53

user545424 · Answer 1 · 2013-09-04T22:09:20.647

32

This should do the trick:

def unique_rows(a):
    a = np.ascontiguousarray(a)
    unique_a = np.unique(a.view([('', a.dtype)]*a.shape[1]))
    return unique_a.view(a.dtype).reshape((unique_a.shape[0], a.shape[1]))

Example:

>>> a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]])
>>> unique_rows(a)
array([[1, 1],
       [2, 3],
       [5, 4]])

edited Sep 04 '13 at 22:09

answered Dec 19 '11 at 21:41

user545424

15,713
11
56
70

1

@user100464, edited so that it will work with transposed arrays. – user545424 Sep 04 '13 at 22:09

Bi Rico · Accepted Answer · 2011-12-20T03:25:28.933

16

Here's one idea, it'll take a little bit of work but could be quite fast. I'll give you the 1d case and let you figure out how to extend it to 2d. The following function finds the unique elements of of a 1d array:

import numpy as np
def unique(a):
    a = np.sort(a)
    b = np.diff(a)
    b = np.r_[1, b]
    return a[b != 0]

Now to extend it to 2d you need to change two things. You will need to figure out how to do the sort yourself, the important thing about the sort will be that two identical entries end up next to each other. Second, you'll need to do something like (b != 0).all(axis) because you want to compare the whole row/column. Let me know if that's enough to get you started.

updated: With some help with doug, I think this should work for the 2d case.

import numpy as np
def unique(a):
    order = np.lexsort(a.T)
    a = a[order]
    diff = np.diff(a, axis=0)
    ui = np.ones(len(a), 'bool')
    ui[1:] = (diff != 0).any(axis=1) 
    return a[ui]

edited Dec 20 '11 at 03:25

answered Dec 19 '11 at 16:37

Bi Rico

25,283
3
52
75

+1 just posted my answer, then read yours--it looks like mine is a faithful 2D implementation of yours--same sequence of identical functions (i even had a row concatenation step at first, but i removed it and sliced first row off the original array instead. – doug Dec 19 '11 at 22:26
this answer mostly uses numpy so python2/3 shouldn't' matter. If it's not working for you, there is probably something else going on. – Bi Rico Mar 04 '16 at 21:00
Worked for me in Python3. Note that this doesn't preserve the order. – Ghostkeeper May 24 '16 at 13:44
Note that the lexsort solution is limited in how many columns it supports – Eelco Hoogendoorn Sep 07 '16 at 10:07

score 5 · Answer 3 · answered Nov 28 '13 at 16:36

My method is by turning a 2d array into 1d complex array, where the real part is 1st column, imaginary part is the 2nd column. Then use np.unique. Though this will only work with 2 columns.

import numpy as np 
def unique2d(a):
    x, y = a.T
    b = x + y*1.0j 
    idx = np.unique(b,return_index=True)[1]
    return a[idx]

Example -

a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]])
unique2d(a)
array([[1, 1],
       [2, 3],
       [5, 4]])

Eelco Hoogendoorn · Answer 4 · 2016-04-02T20:41:51.007

3

The numpy_indexed package (disclaimer: I am its author) wraps the solution posted by user545424 in a nice and tested interface, plus many related features:

import numpy_indexed as npi
npi.unique(coordskeys)

edited Apr 02 '16 at 20:41

answered Apr 02 '16 at 14:46

Eelco Hoogendoorn

10,459
1
44
42

doug · Answer 5 · 2011-12-20T03:00:31.260

>>> import numpy as NP
>>> # create a 2D NumPy array with some duplicate rows
>>> A
    array([[1, 1, 1, 5, 7],
           [5, 4, 5, 4, 7],
           [7, 9, 4, 7, 8],
           [5, 4, 5, 4, 7],
           [1, 1, 1, 5, 7],
           [5, 4, 5, 4, 7],
           [7, 9, 4, 7, 8],
           [5, 4, 5, 4, 7],
           [7, 9, 4, 7, 8]])

>>> # first, sort the 2D NumPy array row-wise so dups will be contiguous
>>> # and rows are preserved
>>> a, b, c, d, e = A.T    # create the keys for to pass to lexsort
>>> ndx = NP.lexsort((a, b, c, d, e))
>>> ndx
    array([1, 3, 5, 7, 0, 4, 2, 6, 8])
>>> A = A[ndx,]

>>> # now diff by row
>>> A1 = NP.diff(A, axis=0)
>>> A1
    array([[0, 0, 0, 0, 0],
           [4, 3, 3, 0, 0],
           [0, 0, 0, 0, 0],
           [0, 0, 0, 1, 0],
           [0, 0, 1, 0, 0],
           [2, 5, 0, 2, 1],
           [0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0]])

>>> # the index array holding the location of each duplicate row
>>> ndx = NP.any(A1, axis=1)  
>>> ndx
    array([False,  True, False,  True,  True,  True, False, False], dtype=bool)  

>>> # retrieve the duplicate rows:
>>> A[1:,:][ndx,]
    array([[7, 9, 4, 7, 8],
           [1, 1, 1, 5, 7],
           [5, 4, 5, 4, 7],
           [7, 9, 4, 7, 8]])

Doug, I think you're close but you're going to run into trouble because NP.sort(A, axis=0) sorts each column independently. Try running your method on the two following arrays: `[[0, 0], [1, 1], [2,2]]` and `[[0, 1], [1, 0], [2,2]]`. I added a sort function my my answer that keeps the rows intact while sorting. — Bi Rico, Dec 20 '11 at 01:44
I didn't know about lexsort, I'm going to include it in my answer if that's ok — Bi Rico, Dec 20 '11 at 03:09
@Bago : absolutely--you were first to have solved the heart of problem anyway, which is why i up-voted your answer, and left a comment to let people know that my answer is just a modified version of yours posted several hours later. — doug, Dec 20 '11 at 03:43

score 1 · Answer 6 · answered Dec 19 '11 at 13:54

1

since you refer to numpy.unique, you dont care to maintain the original order, correct? converting into set, which removes duplicate, and then back to list is often used idiom:

>>> x = [(1, 1), (2, 3), (1, 1), (5, 4), (2, 3)]
>>> y = list(set(x))
>>> y
[(5, 4), (2, 3), (1, 1)]
>>>

answered Dec 19 '11 at 13:54

yosukesabai

6,184
4
30
42

1

Yes, the order is not important.The solution of combining list + set is the one I use as example on the OP (which I admit is quite obfuscated). The problem with it is that it uses lists, and therefore the memory used is huge, having the same problem as if I was working just with lists instead of arrays from the beginning. – Sergi Dec 19 '11 at 15:46

Removing duplicate columns and rows from a NumPy 2D array

6 Answers6

Linked