22

I'm using a 2D shape array to store pairs of longitudes+latitudes. At one point, I have to merge two of these 2D arrays, and then remove any duplicated entry. I've been searching for a function similar to numpy.unique, but I've had no luck. Any implementation I've been thinking on looks very "unoptimizied". For example, I'm trying with converting the array to a list of tuples, removing duplicates with set, and then converting to an array again:

coordskeys = np.array(list(set([tuple(x) for x in coordskeys])))

Are there any existing solutions, so I do not reinvent the wheel?

To make it clear, I'm looking for:

>>> a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]])
>>> unique_rows(a)
array([[1, 1], [2, 3],[5, 4]])

BTW, I wanted to use just a list of tuples for it, but the lists were so big that they consumed my 4Gb RAM + 4Gb swap (numpy arrays are more memory efficient).

doug
  • 69,080
  • 24
  • 165
  • 199
Sergi
  • 454
  • 1
  • 4
  • 21

6 Answers6

32

This should do the trick:

def unique_rows(a):
    a = np.ascontiguousarray(a)
    unique_a = np.unique(a.view([('', a.dtype)]*a.shape[1]))
    return unique_a.view(a.dtype).reshape((unique_a.shape[0], a.shape[1]))

Example:

>>> a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]])
>>> unique_rows(a)
array([[1, 1],
       [2, 3],
       [5, 4]])
user545424
  • 15,713
  • 11
  • 56
  • 70
16

Here's one idea, it'll take a little bit of work but could be quite fast. I'll give you the 1d case and let you figure out how to extend it to 2d. The following function finds the unique elements of of a 1d array:

import numpy as np
def unique(a):
    a = np.sort(a)
    b = np.diff(a)
    b = np.r_[1, b]
    return a[b != 0]

Now to extend it to 2d you need to change two things. You will need to figure out how to do the sort yourself, the important thing about the sort will be that two identical entries end up next to each other. Second, you'll need to do something like (b != 0).all(axis) because you want to compare the whole row/column. Let me know if that's enough to get you started.

updated: With some help with doug, I think this should work for the 2d case.

import numpy as np
def unique(a):
    order = np.lexsort(a.T)
    a = a[order]
    diff = np.diff(a, axis=0)
    ui = np.ones(len(a), 'bool')
    ui[1:] = (diff != 0).any(axis=1) 
    return a[ui]
Bi Rico
  • 25,283
  • 3
  • 52
  • 75
  • +1 just posted my answer, then read yours--it looks like mine is a faithful 2D implementation of yours--same sequence of identical functions (i even had a row concatenation step at first, but i removed it and sliced first row off the original array instead. – doug Dec 19 '11 at 22:26
  • this answer mostly uses numpy so python2/3 shouldn't' matter. If it's not working for you, there is probably something else going on. – Bi Rico Mar 04 '16 at 21:00
  • Worked for me in Python3. Note that this doesn't preserve the order. – Ghostkeeper May 24 '16 at 13:44
  • Note that the lexsort solution is limited in how many columns it supports – Eelco Hoogendoorn Sep 07 '16 at 10:07
5

My method is by turning a 2d array into 1d complex array, where the real part is 1st column, imaginary part is the 2nd column. Then use np.unique. Though this will only work with 2 columns.

import numpy as np 
def unique2d(a):
    x, y = a.T
    b = x + y*1.0j 
    idx = np.unique(b,return_index=True)[1]
    return a[idx] 

Example -

a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]])
unique2d(a)
array([[1, 1],
       [2, 3],
       [5, 4]])
kidnakyo
  • 296
  • 3
  • 3
3

The numpy_indexed package (disclaimer: I am its author) wraps the solution posted by user545424 in a nice and tested interface, plus many related features:

import numpy_indexed as npi
npi.unique(coordskeys)
Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
3
>>> import numpy as NP
>>> # create a 2D NumPy array with some duplicate rows
>>> A
    array([[1, 1, 1, 5, 7],
           [5, 4, 5, 4, 7],
           [7, 9, 4, 7, 8],
           [5, 4, 5, 4, 7],
           [1, 1, 1, 5, 7],
           [5, 4, 5, 4, 7],
           [7, 9, 4, 7, 8],
           [5, 4, 5, 4, 7],
           [7, 9, 4, 7, 8]])

>>> # first, sort the 2D NumPy array row-wise so dups will be contiguous
>>> # and rows are preserved
>>> a, b, c, d, e = A.T    # create the keys for to pass to lexsort
>>> ndx = NP.lexsort((a, b, c, d, e))
>>> ndx
    array([1, 3, 5, 7, 0, 4, 2, 6, 8])
>>> A = A[ndx,]

>>> # now diff by row
>>> A1 = NP.diff(A, axis=0)
>>> A1
    array([[0, 0, 0, 0, 0],
           [4, 3, 3, 0, 0],
           [0, 0, 0, 0, 0],
           [0, 0, 0, 1, 0],
           [0, 0, 1, 0, 0],
           [2, 5, 0, 2, 1],
           [0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0]])

>>> # the index array holding the location of each duplicate row
>>> ndx = NP.any(A1, axis=1)  
>>> ndx
    array([False,  True, False,  True,  True,  True, False, False], dtype=bool)  

>>> # retrieve the duplicate rows:
>>> A[1:,:][ndx,]
    array([[7, 9, 4, 7, 8],
           [1, 1, 1, 5, 7],
           [5, 4, 5, 4, 7],
           [7, 9, 4, 7, 8]])
doug
  • 69,080
  • 24
  • 165
  • 199
  • Doug, I think you're close but you're going to run into trouble because NP.sort(A, axis=0) sorts each column independently. Try running your method on the two following arrays: `[[0, 0], [1, 1], [2,2]]` and `[[0, 1], [1, 0], [2,2]]`. I added a sort function my my answer that keeps the rows intact while sorting. – Bi Rico Dec 20 '11 at 01:44
  • I didn't know about lexsort, I'm going to include it in my answer if that's ok – Bi Rico Dec 20 '11 at 03:09
  • @Bago : absolutely--you were first to have solved the heart of problem anyway, which is why i up-voted your answer, and left a comment to let people know that my answer is just a modified version of yours posted several hours later. – doug Dec 20 '11 at 03:43
1

since you refer to numpy.unique, you dont care to maintain the original order, correct? converting into set, which removes duplicate, and then back to list is often used idiom:

>>> x = [(1, 1), (2, 3), (1, 1), (5, 4), (2, 3)]
>>> y = list(set(x))
>>> y
[(5, 4), (2, 3), (1, 1)]
>>> 
yosukesabai
  • 6,184
  • 4
  • 30
  • 42
  • 1
    Yes, the order is not important.The solution of combining list + set is the one I use as example on the OP (which I admit is quite obfuscated). The problem with it is that it uses lists, and therefore the memory used is huge, having the same problem as if I was working just with lists instead of arrays from the beginning. – Sergi Dec 19 '11 at 15:46