3

In Python numpy.unique can remove all duplicates from a 1D array, very efficiently.

1) How about to remove duplicate rows or columns in a 2D array?

2) How about for nD arrays?

inspectorG4dget
  • 110,290
  • 27
  • 149
  • 241
Developer
  • 8,258
  • 8
  • 49
  • 58

3 Answers3

5

If possible I would use pandas.

In [1]: from pandas import *

In [2]: import numpy as np

In [3]: a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]])

In [4]: DataFrame(a).drop_duplicates().values
Out[4]: 
array([[1, 1],
       [2, 3],
       [5, 4]], dtype=int64)
root
  • 76,608
  • 25
  • 108
  • 120
  • `pandas` is not installed yet. Can you give some benchmarks. BTW, input `array` to be `float`s not integers. Try for over 10k points. – Developer Dec 30 '12 at 09:45
  • 2
    Well having `pandas` installed now, its performance is outstanding: for 30k points (3D) with duplicates 10k total 40k, only 0.2s. wow! – Developer Dec 30 '12 at 09:59
1

The following is another approach which performs much better than for loop. 2s for 10k+100 duplicates.

def tuples(A):
    try: return tuple(tuples(a) for a in A)
    except TypeError: return A

b = set(tuples(a))

The idea inspired by Waleed Khan's first part. So no need for any additional package that is may have further applications. It is also super Pythonic, I guess.

Developer
  • 8,258
  • 8
  • 49
  • 58
1

The numpy_indexed package solves this problem for the n-dimensional case. (disclaimer: I am its author). Infact, solving this problem was the motivation for starting this package; but it has grown to include a lot of related functionality.

import numpy_indexed as npi
a = np.random.randint(0, 2, (3, 3, 3))
print(npi.unique(a))
print(npi.unique(a, axis=1))
print(npi.unique(a, axis=2))
Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42