3

I want to have the rows of an array, which are unique. Contrary to numpy's unique function, I want to exclude all rows, which occur more than once.

So the input:

[[1,1],[1,1],[1,2],[2,3],[3,4],[3,4]]

should lead to the output

[[1,2],[2,3]].

I tried to count the appearance of each row with np.unique(array, return_counts=True) and filter the result afterwards with those entries being >1. I'm looking both for a more efficient way to do that, as well as doing the same thing without the counts returned, as they are implemented not before numpy 1.9.

Update: The datasize in my case is always [m,2], but once the concept is established, it should be easily transferable to the [m,n] case. In my special case, the dataset is consisting of integers, but solutions don't have to be limited to that assumption. A typical dataset will have m ~ 10^7.

Dschoni
  • 3,714
  • 6
  • 45
  • 80
  • What's the datasize of the input array? Are they always integers? – Divakar Nov 18 '15 at 17:31
  • 3
    See the answers [here](http://stackoverflow.com/questions/27000092/count-how-many-times-each-row-is-present-in-numpy-array?lq=1) for counting row frequency and then use boolean indexing. – Alex Riley Nov 18 '15 at 17:31
  • 1
    I don't think it can be more efficient than that because creating the counts dict would be O(N). You could use `collections.Counter` and that should do the same thing if you don't want to use numpy. – slider Nov 18 '15 at 17:34
  • This is close to being a duplicate of http://stackoverflow.com/q/16970982/1461210, with the exception that you also want to exclude *all* of the rows that occur multiple times, rather than excluding all but one of the duplicates. – ali_m Nov 18 '15 at 18:17
  • Your example shows an array with shape (m, 2), and the values in the array are small integers. Is this typical data? Or might the array be (m, n) with n > 2, or contain integers without an *a priori* bound on the values, or floating point? – Warren Weckesser Nov 18 '15 at 19:41
  • @ali_m: I don't think this is a duplicate cause with numpy.unique the answer to the linked question is trivial. I'm explicitly looking for the exclusion. – Dschoni Nov 18 '15 at 21:20
  • @Dschoni True, but once you can get the unique rows with `np.unique` it would be trivial to exclude the duplicated items by returning the item counts using `np.unique(..., return_counts=True)` and filtering the unique row indices based on that. – ali_m Nov 18 '15 at 21:29
  • @ali_m: As mentioned in my original post, I'm looking for a solution, which works without ´return_counts´ as this is not implemented in numpy 1.8. – Dschoni Nov 19 '15 at 14:22
  • @Dschoni Did you try out the posted solution? – Divakar Nov 19 '15 at 14:47
  • @ajcr: Your linked solution seems to be the fastest, could you add that as an aswer, so that I can accept it? – Dschoni Nov 19 '15 at 15:18
  • @Dschoni: the solution I proposed in the linked page is only really good for small arrays (say 1000 rows or fewer). If you have larger arrays, a different method to find the counts is recommended. If you're OK with that I can post an answer with a caveat... – Alex Riley Nov 19 '15 at 15:36
  • @ajcr: That's totally satisfying for my case. I'm nevertheless going to benchmark things with larger arrays. It may even be a strong enough argument to move to a newer numpy version ;) – Dschoni Nov 23 '15 at 12:45

2 Answers2

1

Approach #1

Here's one approach using lex-sorting and np.bincount -

# Perform lex sort and get the sorted array version of the input
sorted_idx = np.lexsort(A.T)
sorted_Ar =  A[sorted_idx,:]

# Mask of start of each unique row in sorted array 
mask = np.append(True,np.any(np.diff(sorted_Ar,axis=0),1))

# Get counts of each unique row
unq_count = np.bincount(mask.cumsum()-1) 

# Compare counts to 1 and select the corresponding unique row with the mask
out = sorted_Ar[mask][np.nonzero(unq_count==1)[0]]

Please note that the output would not maintain the order of elements as originally present in the input array.

Approach #2

If the elements are integers, then you can convert 2D array A to a 1D array assuming each row as an indexing tuple and that should be a pretty efficient solution. Also, please note that this approach would maintain the order of elements in the output. The implementation would be -

# Convert 2D array A to a 1D array assuming each row as an indexing tuple
A_1D = A.dot(np.append(A.max(0)[::-1].cumprod()[::-1][1:],1))

# Get sorting indices for the 1D array
sort_idx = A_1D.argsort()

# Mask of start of each unique row in 1D sorted array 
mask = np.append(True,np.diff(A_1D[sort_idx])!=0)

# Get the counts of each unique 1D element
counts = np.bincount(mask.cumsum()-1)

# Select the IDs with counts==1 and thus the unique rows from A
out = A[sort_idx[np.nonzero(mask)[0][counts==1]]]

Runtime tests and verification

Functions -

def unq_rows_v1(A):
    sorted_idx = np.lexsort(A.T)
    sorted_Ar =  A[sorted_idx,:]
    mask = np.append(True,np.any(np.diff(sorted_Ar,axis=0),1))
    unq_count = np.bincount(mask.cumsum()-1) 
    return sorted_Ar[mask][np.nonzero(unq_count==1)[0]]

def unq_rows_v2(A):
    A_1D = A.dot(np.append(A.max(0)[::-1].cumprod()[::-1][1:],1))
    sort_idx = A_1D.argsort()
    mask = np.append(True,np.diff(A_1D[sort_idx])!=0)
    return A[sort_idx[np.nonzero(mask)[0][np.bincount(mask.cumsum()-1)==1]]]

Timings & Verify Outputs -

In [272]: A = np.random.randint(20,30,(10000,5))

In [273]: unq_rows_v1(A).shape
Out[273]: (9051, 5)

In [274]: unq_rows_v2(A).shape
Out[274]: (9051, 5)

In [275]: %timeit unq_rows_v1(A)
100 loops, best of 3: 5.07 ms per loop

In [276]: %timeit unq_rows_v2(A)
1000 loops, best of 3: 1.96 ms per loop
Divakar
  • 218,885
  • 19
  • 262
  • 358
1

The numpy_indexed package (disclaimer: I am its author) is able to solve this problem efficiently, in a fully vectorized manner. I havnt tested with numpy yet 1.9, if that is still relevant, but perhaps youd be willing to give it a spin and let me know. I don't have any reason to believe it will not work with older versions of numpy.

a = np.random.rand(10000, 3).round(2)
unique, count = npi.count(a)
print(unique[count == 1])

Note that as per your original question, this solution is not restricted to a specific number of columns, or dtype.

Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42