1

I have found other methods, such as this, to remove duplicate elements from an array. My requirement is slightly different. If I start with:

array([[1, 2, 3],
       [2, 3, 4],
       [1, 2, 3],
       [3, 2, 1],
       [3, 4, 5]])

I would like to end up with:

array([[2, 3, 4],
       [3, 2, 1]
       [3, 4, 5]])

That's what I would ultimately like to end up with, but there is an extra requirement. I would also like to store either an array of indices to discard, or to keep (a la numpy.take).

I am using Numpy 1.8.1

Community
  • 1
  • 1
codedog
  • 2,488
  • 9
  • 38
  • 67
  • You can count how many time each row appears using methods suggested, for example, [here](http://stackoverflow.com/q/27000092/3923281) and [here](http://stackoverflow.com/q/33786245/3923281). I think that's what your problem here reduces to. – Alex Riley Dec 06 '15 at 21:51
  • @ajcr I can't use `return_counts` so #1 is out for me. Unfortunately #2 seems to require sorted array, and I need to preserve the order. – codedog Dec 06 '15 at 22:35
  • @codedog Were either of the answers helpful? If not, could you let us know what else you're looking for, – ilyas patanam Dec 08 '15 at 06:10

4 Answers4

1

We want to find rows which are not duplicated in your array, while preserving the order.

I use this solution to combine each row of a into a single element, so that we can find the unique rows using np.unique(,return_index=True, return_inverse= True). Then, I modified this function to output the counts of the unique rows using the index and inverse. From there, I can select all unique rows which have counts == 1.

a = np.array([[1, 2, 3],
       [2, 3, 4],
       [1, 2, 3],
       [3, 2, 1],
       [3, 4, 5]])

#use a flexible data type, np.void, to combine the columns of `a`
#size of np.void is the number of bytes for an element in `a` multiplied by number of columns
b = a.view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, index, inv = np.unique(b, return_index = True, return_inverse = True)

def return_counts(index, inv):
    count = np.zeros(len(index), np.int)
    np.add.at(count, inv, 1)
    return count

counts = return_counts(index, inv)

#if you want the indices to discard replace with: counts[i] > 1
index_keep = [i for i, j in enumerate(index) if counts[i] == 1]

>>>a[index_keep]
array([[2, 3, 4],
   [3, 2, 1],
   [3, 4, 5]])

#if you don't need the indices and just want the array returned while preserving the order
a_unique = np.vstack(a[idx] for i, idx in enumerate(index) if counts[i] == 1])
>>>a_unique
array([[2, 3, 4],
   [3, 2, 1],
   [3, 4, 5]])

For np.version >= 1.9

b = a.view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, index, counts = np.unique(b, return_index = True, return_counts = True)

index_keep = [i for i, j in enumerate(index) if counts[i] == 1]
>>>a[index_keep]
array([[2, 3, 4],
   [3, 2, 1],
   [3, 4, 5]])
Community
  • 1
  • 1
ilyas patanam
  • 5,116
  • 2
  • 29
  • 33
  • except the request was to exclude [1,2,3] since it occurs more than once –  Dec 07 '15 at 04:18
  • @Dan Patterson Thanks for pointing it out, I have edited my solution. – ilyas patanam Dec 07 '15 at 17:31
  • Yesterday I found out we have implemented this in a C extension. I have not tested this solution explicitly but it looks very similar to what has been implemented here. That's why I accepted it as a solution. Thanks. – codedog Dec 08 '15 at 22:42
1

You can proceed as follows:

# Assuming your array is a
uniq, uniq_idx, counts = np.unique(a, axis=0, return_index=True, return_counts=True)

# to return the array you want
new_arr = uniq[counts == 1]

# The indices of non-unique rows
a_idx = np.arange(a.shape[0]) # the indices of array a
nuniq_idx = a_idx[np.in1d(a_idx, uniq_idx[counts==1], invert=True)] 

You get:

#new_arr
array([[2, 3, 4],
       [3, 2, 1],
       [3, 4, 5]])

# nuniq_idx
array([0, 2])

s.ouchene
  • 1,682
  • 13
  • 31
0

If you want to delete all instances of the elements, that exists in duplicate versions, you could iterate through the array, find the indexes of elements existing in more than one version, and lastly delete these:

# The array to check:
array = numpy.array([[1, 2, 3],
        [2, 3, 4],
        [1, 2, 3],
        [3, 2, 1],
        [3, 4, 5]])

# List that contains the indices of duplicates (which should be deleted)
deleteIndices = []

for i in range(0,len(array)): # Loop through entire array
    indices = range(0,len(array)) # All indices in array
    del indices[i] # All indices in array, except the i'th element currently being checked

for j in indexes: # Loop through every other element in array, except the i'th element, currently being checked
    if(array[i] == array[j]).all(): # Check if element being checked is equal to the j'th element
        deleteIndices.append(j) # If i'th and j'th element are equal, j is appended to deleteIndices[]

# Sort deleteIndices in ascending order:
deleteIndices.sort()

# Delete duplicates
array = numpy.delete(array,deleteIndices,axis=0)

This outputs:

>>> array
array([[2, 3, 4],
       [3, 2, 1],
       [3, 4, 5]])

>>> deleteIndices
[0, 2]

Like that you both delete the duplicates and get a list of indices to discard.

Johan E. T.
  • 185
  • 8
0

The numpy_indexed package (disclaimer: I am its author) can be used to solve such problems in a vectorized manner:

index = npi.as_index(arr)
keep = index.count == 1
discard = np.invert(keep)
print(index.unique[keep])
Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42