3

I have a 3D numpy array like this:

>>> a
array([[[0, 1, 2],
        [0, 1, 2],
        [6, 7, 8]],
       [[6, 7, 8],
        [0, 1, 2],
        [6, 7, 8]],
       [[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]]])

I want to remove only those rows which contain duplicates within themselves. For instance the output should look like this:

>>> remove_row_duplicates(a)
array([[[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]]])

This is the function that I am using:

delindices = np.empty(0, dtype=int)

for i in range(len(a)):
    _, indices = np.unique(np.around(a[i], decimals=10), axis=0, return_index=True)

    if len(indices) < len(a[i]):

        delindices = np.append(delindices, i) 

a = np.delete(a, delindices, 0)

This works perfectly, but the problem is now my array shape is like (1000000,7,3). The for loop is pretty slow in python and this take a lot of time. Also my original array contains floating numbers. Any one who has a better solution or who can help me vectorizing this function?

Malik
  • 97
  • 6

2 Answers2

2

Sort it along the rows for each 2D block i.e. along axis=1 and then look for matching rows along the successive ones and finally look for any matches along the same axis=1 -

b = np.sort(a,axis=1)
out = a[~((b[:,1:] == b[:,:-1]).all(-1)).any(1)]

Sample run with explanation

Input array :

In [51]: a
Out[51]: 
array([[[0, 1, 2],
        [0, 1, 2],
        [6, 7, 8]],

       [[6, 7, 8],
        [0, 1, 2],
        [6, 7, 8]],

       [[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]]])

Code steps :

# Sort along axis=1, i.e rows in each 2D block
In [52]: b = np.sort(a,axis=1)

In [53]: b
Out[53]: 
array([[[0, 1, 2],
        [0, 1, 2],
        [6, 7, 8]],

       [[0, 1, 2],
        [6, 7, 8],
        [6, 7, 8]],

       [[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]]])

In [54]: (b[:,1:] == b[:,:-1]).all(-1) # Look for successive matching rows
Out[54]: 
array([[ True, False],
       [False,  True],
       [False, False]])

# Look for matches along each row, which indicates presence
# of duplicate rows within each 2D block in original 2D array
In [55]: ((b[:,1:] == b[:,:-1]).all(-1)).any(1)
Out[55]: array([ True,  True, False])

# Invert those as we need to remove those cases
# Finally index with boolean indexing and get the output
In [57]: a[~((b[:,1:] == b[:,:-1]).all(-1)).any(1)]
Out[57]: 
array([[[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]]])
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Got it . Thanks – Malik Jul 14 '18 at 09:32
  • There is a bit problem with your algorithm. It will not work if the two similar rows in one 2D block are not next to each other. My case is a bit more general. They may or may not be next to each other. – Malik Jul 15 '18 at 07:15
  • @Malik We are arranging them next to each other with the sort step at the very beginning. Does that clarify your doubt(s)? Look at : `b = np.sort(a,axis=1)`. – Divakar Jul 15 '18 at 07:18
  • aha, yup I see it now. – Malik Jul 15 '18 at 07:20
1

You can probably do this easily using broadcasting but since you're dealing with more than 2D arrays it wont be as optimized as you expect and even in some cases very slow. Instead you can use following approach inspired by Jaime's answer:

In [28]: u = np.unique(arr.view(np.dtype((np.void, arr.dtype.itemsize*arr.shape[1])))).view(arr.dtype).reshape(-1, arr.shape[1])

In [29]: inds = np.where((arr == u).all(2).sum(0) == u.shape[1])

In [30]: arr[inds]
Out[30]: 
array([[[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]]])
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • What if `a[1] = a[0]; a[2] = a[0]`? – Divakar Jul 14 '18 at 09:43
  • @Divakar I think I'll depend on whether those equal axis are contain unique rows or not. If they do this code will return both of them and here we need to know what's the OP's expected output. – Mazdak Jul 14 '18 at 09:53
  • OP 's working code would return empty array as also stated in the question that we need to remove duplicates. – Divakar Jul 14 '18 at 09:59
  • @Divakar In that case the `u` will be the answer. – Mazdak Jul 14 '18 at 10:00
  • Why it would be `u`? It's just unique rows globally and not specific to each 2D block, which is what OP wants. – Divakar Jul 14 '18 at 10:02