257

I need to find unique rows in a numpy.array.

For example:

>>> a # I have
array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])
>>> new_a # I want to get to
array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 1, 0]])

I know that i can create a set and loop over the array, but I am looking for an efficient pure numpy solution. I believe that there is a way to set data type to void and then I could just use numpy.unique, but I couldn't figure out how to make it work.

Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234
Akavall
  • 82,592
  • 51
  • 207
  • 251
  • 15
    pandas has a dataframe.drop_duplicates() method. See http://stackoverflow.com/questions/12322779/pandas-unique-dataframe and http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.drop_duplicates.html – codeape Jun 06 '13 at 19:55
  • 2
    possible duplicate of [Removing duplicates in each row of a numpy array](http://stackoverflow.com/questions/7438438/removing-duplicates-in-each-row-of-a-numpy-array) – Andy Hayden Jun 06 '13 at 20:00
  • How about http://stackoverflow.com/a/8567929/3571 ? – codeape Jun 06 '13 at 20:08
  • 1
    @Andy Hayden, despite the title, it is not a duplicate to this question. codeape's link is a duplicate though. – Wai Yip Tung Nov 23 '13 at 06:21
  • 5
    This feature is coming natively to 1.13: https://github.com/numpy/numpy/pull/7742 – Eric Nov 18 '16 at 10:40

20 Answers20

196

As of NumPy 1.13, one can simply choose the axis for selection of unique values in any N-dim array. To get unique rows, use np.unique as follows:

unique_rows = np.unique(original_array, axis=0)
Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
aiwabdn
  • 2,150
  • 1
  • 13
  • 7
  • 23
    Careful with this function. `np.unique(list_cor, axis=0)` gets you the _array with duplicate rows removed_; it does not filter the array to elements that _are unique in the original array_. See [here](https://stackoverflow.com/q/47562201/7954504), for example.. – Brad Solomon Nov 29 '17 at 22:08
  • 1
    Note that if you want unique rows ignoring order of values in the row, you can sort the original array in the columns direct first: `original_array.sort(axis=1)` – mangecoeur Mar 02 '20 at 11:59
  • I wish there was the equivalent of Pandas `drop_duplicates()`: it doesn't sort (uses an efficient hashing algorithm instead). Sorting is often unwanted, and incurs extra computation. – Pierre D Feb 17 '23 at 03:07
147

Yet another possible solution

np.vstack({tuple(row) for row in a})

Edit: As others have mentioned this approach is deprecated as of NumPy 1.16. In modern versions you can do

np.vstack(tuple(set(map(tuple,a))))

Where map(tuple,a) makes every row of the matrix a hashable by making it them tuples. set(map(tuple,a)) creates a set out of all of these unique rows. Sets are non-sequence iterables and as such cannot be directly used to construct NumPy arrays anymore. The outer call to tuple fixes this problem by converting the set to a tuple, making it acceptable for creating an array.

Greg von Winckel
  • 2,261
  • 2
  • 16
  • 14
  • 22
    +1 This is clear, short and pythonic. Unless speed is a real issue, these type of solutions should take preference over the complex, higher voted answers to this question IMO. – Bill Cheatham Apr 30 '14 at 13:36
  • 3
    Excellent! Curly braces or the set() function does the trick. – Tian He May 04 '16 at 15:51
  • 3
    @Greg von Winckel Can you suggest something which doesn't something which doesn't change order. – Laschet Jain Feb 12 '17 at 22:30
  • Yes, but not in a single command: x=[]; [x.append(tuple(r)) for r in a if tuple(r) not in x]; a_unique = array(x); – Greg von Winckel May 12 '17 at 15:18
  • 3
    To avoid a FutureWarning, convert the set to a list like: `np.vstack(list({tuple(row) for row in AIPbiased[i, :, :]}))` FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future. – leermeester Dec 10 '19 at 08:21
115

Another option to the use of structured arrays is using a view of a void type that joins the whole row into a single item:

a = np.array([[1, 1, 1, 0, 0, 0],
              [0, 1, 1, 1, 0, 0],
              [0, 1, 1, 1, 0, 0],
              [1, 1, 1, 0, 0, 0],
              [1, 1, 1, 1, 1, 0]])

b = np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, idx = np.unique(b, return_index=True)

unique_a = a[idx]

>>> unique_a
array([[0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])

EDIT Added np.ascontiguousarray following @seberg's recommendation. This will slow the method down if the array is not already contiguous.

EDIT The above can be slightly sped up, perhaps at the cost of clarity, by doing:

unique_a = np.unique(b).view(a.dtype).reshape(-1, a.shape[1])

Also, at least on my system, performance wise it is on par, or even better, than the lexsort method:

a = np.random.randint(2, size=(10000, 6))

%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
100 loops, best of 3: 3.17 ms per loop

%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
100 loops, best of 3: 5.93 ms per loop

a = np.random.randint(2, size=(10000, 100))

%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
10 loops, best of 3: 29.9 ms per loop

%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
10 loops, best of 3: 116 ms per loop
Jaime
  • 65,696
  • 17
  • 124
  • 159
  • 3
    Thanks a lot. This is the answer that I was looking for, can you explain what is going on in this step: `b = a.view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))` ? – Akavall Jun 07 '13 at 00:28
  • 3
    @Akavall It is creating a view of your data with a `np.void` data type of size the number of bytes in a full row. It´s similar two what you get if you have an array of `np.uint8`s and view it as `np.uint16`s, which combines every two columns into a single one, but more flexible. – Jaime Jun 07 '13 at 02:34
  • This is slower, less flexible, and more memory intensive (`np.unique` creates a copy, as does `a[idx]`) than a lexsort-based method. – cge Jun 07 '13 at 07:32
  • 3
    @Jaime, can you add a `np.ascontiguousarray` or similar to be generally safe (I know it is a bit more restrictive then necessary, but...). The rows *must* be contiguous for view to work as expected. – seberg Jun 07 '13 at 10:04
  • Interesting: I had assumed this would be equivalent to the structured array method, but it's actually significantly better, and does seem to do better than the lexsort method as well, especially for larger shape[1]. – cge Jun 07 '13 at 18:53
  • 2
    @ConstantineEvans It is a recent addition: in numpy 1.6, trying to run `np.unique` on an array of `np.void` returns an error related to mergesort not being implemented for that type. It works fine in 1.7 though. – Jaime Jun 07 '13 at 20:01
  • Jaime and @ConstantineEvans. Do you know how to get column membership from any of the solutions that you presented? I started a new question based on these answers here: http://stackoverflow.com/questions/18197071/find-unique-columns-and-column-membership – Amelio Vazquez-Reina Aug 12 '13 at 21:55
  • 11
    It's worth noting that if this method is used for floating point numbers there's a catch that `-0.` will not compare as equal to `+0.`, whereas an element-by-element comparison would have `-0.==+0.` (as specified by the ieee float standard). See http://stackoverflow.com/questions/26782038/how-to-eliminate-the-extra-minus-sign-when-rounding-negative-numbers-towards-zer – tom10 Nov 06 '14 at 23:52
  • 1
    @tom10 Yes, absolutely true. Then again, finding unique values in a floating point array is risky business anyway... – Jaime Nov 07 '14 at 00:34
  • @ali_m My feeling is that the concerns were mostly about the interface, see [here](http://mail.scipy.org/pipermail/numpy-discussion/2013-August/067443.html). There wasn't much enthusiasm on the list, because there rarely is much enthusiasm about anything that doesn't involve bashing `np.matrix`... ;-) I think Joe got discouraged a little too early, and wouldn't be surprised if something similar to that PR eventually made it into numpy. – Jaime Apr 21 '15 at 00:51
  • In my case the first option (`...; unique_a = a[idx]`) performs actually faster: 4.8s vs 6.4s for shape (1.5e7, 2), consistent in two repeats – Dima Lituiev Nov 18 '15 at 02:03
  • 1
    @DimaLituiev That's a surprising result... Calling `unique` with `return_index=True` should always be more expensive than doing it without, as it has to do extra work to compute the index array. On top of that, you then have to use that index array to select some values from the original array. On the other side, the `.view` and `.reshape` operations should be virtually free, as they only affect array metadata. – Jaime Nov 18 '15 at 06:08
  • 1
    unfortunately fails on `dtype= bool` – Dima Lituiev Nov 18 '15 at 07:01
  • 1
    This is a great answer, but now since this functionality is built-in (in numpy 1.13), I have accepted a build-in solution answer, since at this point, using built-in would be the best approach to take. – Akavall Jul 10 '17 at 16:15
  • Is there any benefit to this answer over the `axis` argument of `np.unique()`, now that that exists? – endolith Aug 07 '19 at 15:03
31

If you want to avoid the memory expense of converting to a series of tuples or another similar data structure, you can exploit numpy's structured arrays.

The trick is to view your original array as a structured array where each item corresponds to a row of the original array. This doesn't make a copy, and is quite efficient.

As a quick example:

import numpy as np

data = np.array([[1, 1, 1, 0, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [1, 1, 1, 0, 0, 0],
                 [1, 1, 1, 1, 1, 0]])

ncols = data.shape[1]
dtype = data.dtype.descr * ncols
struct = data.view(dtype)

uniq = np.unique(struct)
uniq = uniq.view(data.dtype).reshape(-1, ncols)
print uniq

To understand what's going on, have a look at the intermediary results.

Once we view things as a structured array, each element in the array is a row in your original array. (Basically, it's a similar data structure to a list of tuples.)

In [71]: struct
Out[71]:
array([[(1, 1, 1, 0, 0, 0)],
       [(0, 1, 1, 1, 0, 0)],
       [(0, 1, 1, 1, 0, 0)],
       [(1, 1, 1, 0, 0, 0)],
       [(1, 1, 1, 1, 1, 0)]],
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])

In [72]: struct[0]
Out[72]:
array([(1, 1, 1, 0, 0, 0)],
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])

Once we run numpy.unique, we'll get a structured array back:

In [73]: np.unique(struct)
Out[73]:
array([(0, 1, 1, 1, 0, 0), (1, 1, 1, 0, 0, 0), (1, 1, 1, 1, 1, 0)],
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])

That we then need to view as a "normal" array (_ stores the result of the last calculation in ipython, which is why you're seeing _.view...):

In [74]: _.view(data.dtype)
Out[74]: array([0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0])

And then reshape back into a 2D array (-1 is a placeholder that tells numpy to calculate the correct number of rows, give the number of columns):

In [75]: _.reshape(-1, ncols)
Out[75]:
array([[0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])

Obviously, if you wanted to be more concise, you could write it as:

import numpy as np

def unique_rows(data):
    uniq = np.unique(data.view(data.dtype.descr * data.shape[1]))
    return uniq.view(data.dtype).reshape(-1, data.shape[1])

data = np.array([[1, 1, 1, 0, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [1, 1, 1, 0, 0, 0],
                 [1, 1, 1, 1, 1, 0]])
print unique_rows(data)

Which results in:

[[0 1 1 1 0 0]
 [1 1 1 0 0 0]
 [1 1 1 1 1 0]]
Joe Kington
  • 275,208
  • 71
  • 604
  • 463
  • This actually seems very slow, almost as slow as using tuples. Sorting a structured array like this is slow, apparently. – cge Jun 06 '13 at 20:28
  • 3
    @cge - Try it with larger-sized arrays. Yes, sorting a numpy array is slower than sorting a list. Speed isn't the main consideration in most cases where you're using ndarrays, though. It's memory usage. A list of tuples will use _vastly_ more memory than this solution. Even if you have enough memory, with a reasonably large array, converting it to a list of tuples has greater overhead than the speed advantage. – Joe Kington Jun 06 '13 at 20:34
  • @cge - Ah, I didn't notice you were using `lexsort`. I thought you were referring to using a list of tuples. Yeah, `lexsort` is probably the better option in this case. I'd forgotten about it, and jumped to an overly complex solution. – Joe Kington Jun 06 '13 at 20:37
20

np.unique when I run it on np.random.random(100).reshape(10,10) returns all the unique individual elements, but you want the unique rows, so first you need to put them into tuples:

array = #your numpy array of lists
new_array = [tuple(row) for row in array]
uniques = np.unique(new_array)

That is the only way I see you changing the types to do what you want, and I am not sure if the list iteration to change to tuples is okay with your "not looping through"

Ryan Saxe
  • 17,123
  • 23
  • 80
  • 128
  • 5
    +1 This is clear, short and pythonic. Unless speed is a real issue, these type of solutions should take preference over the complex, higher voted answers to this question IMO. – Bill Cheatham Apr 30 '14 at 13:36
  • I prefer this over the accepted solution. Speed isn't an issue for me because I only have perhaps `< 100` rows per invocation. This precisely describes how performing unique over rows is performed. – rayryeng Apr 01 '15 at 17:04
  • 6
    This actually does not work for my data, `uniques` contains unique elements. Potentially I misunderstand the expected shape of `array` - could you be more precise here? – FooBar Apr 20 '15 at 13:34
  • @ryan-saxe I like that this is pythonic but this is not a good solution because the row returned to `uniques` are sorted (and therefore different from the rows in `array`). `B = np.array([[1,2],[2,1]]); A = np.unique([tuple(row) for row in B]); print(A) = array([[1, 2],[1, 2]])` – jmlarson Mar 23 '16 at 12:20
19

np.unique works by sorting a flattened array, then looking at whether each item is equal to the previous. This can be done manually without flattening:

ind = np.lexsort(a.T)
a[ind[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]]

This method does not use tuples, and should be much faster and simpler than other methods given here.

NOTE: A previous version of this did not have the ind right after a[, which mean that the wrong indices were used. Also, Joe Kington makes a good point that this does make a variety of intermediate copies. The following method makes fewer, by making a sorted copy and then using views of it:

b = a[np.lexsort(a.T)]
b[np.concatenate(([True], np.any(b[1:] != b[:-1],axis=1)))]

This is faster and uses less memory.

Also, if you want to find unique rows in an ndarray regardless of how many dimensions are in the array, the following will work:

b = a[lexsort(a.reshape((a.shape[0],-1)).T)];
b[np.concatenate(([True], np.any(b[1:]!=b[:-1],axis=tuple(range(1,a.ndim)))))]

An interesting remaining issue would be if you wanted to sort/unique along an arbitrary axis of an arbitrary-dimension array, something that would be more difficult.

Edit:

To demonstrate the speed differences, I ran a few tests in ipython of the three different methods described in the answers. With your exact a, there isn't too much of a difference, though this version is a bit faster:

In [87]: %timeit unique(a.view(dtype)).view('<i8')
10000 loops, best of 3: 48.4 us per loop

In [88]: %timeit ind = np.lexsort(a.T); a[np.concatenate(([True], np.any(a[ind[1:]]!= a[ind[:-1]], axis=1)))]
10000 loops, best of 3: 37.6 us per loop

In [89]: %timeit b = [tuple(row) for row in a]; np.unique(b)
10000 loops, best of 3: 41.6 us per loop

With a larger a, however, this version ends up being much, much faster:

In [96]: a = np.random.randint(0,2,size=(10000,6))

In [97]: %timeit unique(a.view(dtype)).view('<i8')
10 loops, best of 3: 24.4 ms per loop

In [98]: %timeit b = [tuple(row) for row in a]; np.unique(b)
10 loops, best of 3: 28.2 ms per loop

In [99]: %timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!= a[ind[:-1]],axis=1)))]
100 loops, best of 3: 3.25 ms per loop
j-i-l
  • 10,281
  • 3
  • 53
  • 70
cge
  • 9,552
  • 3
  • 32
  • 51
  • Very nice! On a side note, though, it does make several intermediary copies. (e.g. `a[ind[1:]]` is a copy, etc) On the other hand, your solution is generally 2-3x faster than mine up until you run out of ram. – Joe Kington Jun 06 '13 at 20:55
  • Good point. As it turns out, my attempt to take out intermediary copies by using just the indexes made my method use more memory and end up slower than just making a sorted copy of the array, as a_sorted[1:] isn't a copy of a_sorted. – cge Jun 06 '13 at 21:16
  • What is `dtype` in your timings? I think you got that one wrong. On my system, calling `np.unique` as described in my answer is slightly faster than using either of your two flavors of `np.lexsort`. And it is about 5x faster if the array to find uniques has shape `(10000, 100)`. Even if you decide to reimplement what `np.unique` does to trim some (minor) execution time, collapsing every row into a single object runs faster comparisons than having to call `np.any` on the comparison of the columns, especially for higher column counts. – Jaime Jun 07 '13 at 09:55
  • @cge: you probably meant 'np.any' instead of standard 'any' wich does not take keyword argument. – M. Toya Sep 12 '13 at 10:59
  • @Jaime - I believe `dtype` is just `a.dtype`, i.e. the data type of the data being viewed, as is was done by Joe Kington in his answer. If there are many columns, another (imperfect!) way to keep things fast using `lexsort` is to only sort on a few columns. This is data-specific as one needs to know which columns provide enough variance to sort perfectly. E.g. `a.shape = (60000, 500)` - sort on the first 3 columns: `ind = np.lexsort((a[:, 2], a[:, 1], a[:, 0]))`. The time savings are fairly substantial, but the disclaimer again: it might not catch all cases - it depends on the data. – n1k31t4 Mar 21 '18 at 13:31
12

I've compared the suggested alternative for speed and found that, surprisingly, the void view unique solution is even a bit faster than numpy's native unique with the axis argument. If you're looking for speed, you'll want

numpy.unique(
    a.view(numpy.dtype((numpy.void, a.dtype.itemsize*a.shape[1])))
).view(a.dtype).reshape(-1, a.shape[1])

I've implemented that fastest variant in npx.unique_rows.

There is a bug report on GitHub for this, too.

enter image description here


Code to reproduce the plot:

import numpy
import perfplot


def unique_void_view(a):
    return (
        numpy.unique(a.view(numpy.dtype((numpy.void, a.dtype.itemsize * a.shape[1]))))
        .view(a.dtype)
        .reshape(-1, a.shape[1])
    )


def lexsort(a):
    ind = numpy.lexsort(a.T)
    return a[
        ind[numpy.concatenate(([True], numpy.any(a[ind[1:]] != a[ind[:-1]], axis=1)))]
    ]


def vstack(a):
    return numpy.vstack([tuple(row) for row in a])


def unique_axis(a):
    return numpy.unique(a, axis=0)


perfplot.show(
    setup=lambda n: numpy.random.randint(2, size=(n, 20)),
    kernels=[unique_void_view, lexsort, vstack, unique_axis],
    n_range=[2 ** k for k in range(15)],
    xlabel="len(a)",
    equality_check=None,
)
Nico Schlömer
  • 53,797
  • 27
  • 201
  • 249
  • 1
    Very nice answer, one minor point: `vstack_dict`, never uses a dict, curly braces is a set comprehension, and therefore its behavior is nearly identical to `vstatck_set`. Since, `vstack_dict` performance line is missing for fro graph, it looks like it is just being covered by `vstack_set` performance graph, since they are so similar! – Akavall Jul 09 '17 at 16:56
  • Thanks for the reply. I've improved the plot to include only one `vstack` variant. – Nico Schlömer Jul 10 '17 at 07:46
9

Here is another variation for @Greg pythonic answer

np.vstack(set(map(tuple, a)))
divenex
  • 15,176
  • 9
  • 55
  • 55
8

I didn’t like any of these answers because none handle floating-point arrays in a linear algebra or vector space sense, where two rows being “equal” means “within some ”. The one answer that has a tolerance threshold, https://stackoverflow.com/a/26867764/500207, took the threshold to be both element-wise and decimal precision, which works for some cases but isn’t as mathematically general as a true vector distance.

Here’s my version:

from scipy.spatial.distance import squareform, pdist

def uniqueRows(arr, thresh=0.0, metric='euclidean'):
    "Returns subset of rows that are unique, in terms of Euclidean distance"
    distances = squareform(pdist(arr, metric=metric))
    idxset = {tuple(np.nonzero(v)[0]) for v in distances <= thresh}
    return arr[[x[0] for x in idxset]]

# With this, unique columns are super-easy:
def uniqueColumns(arr, *args, **kwargs):
    return uniqueRows(arr.T, *args, **kwargs)

The public-domain function above uses scipy.spatial.distance.pdist to find the Euclidean (customizable) distance between each pair of rows. Then it compares each each distance to a threshold to find the rows that are within thresh of each other, and returns just one row from each thresh-cluster.

As hinted, the distance metric needn’t be Euclidean—pdist can compute sundry distances including cityblock (Manhattan-norm) and cosine (the angle between vectors).

If thresh=0 (the default), then rows have to be bit-exact to be considered “unique”. Other good values for thresh use scaled machine-precision, i.e., thresh=np.spacing(1)*1e3.

Community
  • 1
  • 1
Ahmed Fasih
  • 6,458
  • 7
  • 54
  • 95
  • Best answer. Thanks. It is the most (mathematically) generalized answer written so far. It considers a matrix as a set of data points or samples in the N-dimensional space and find a collection of same or similar points (similarity being defined by either Euclidean distance or by any other methods). These points can be overlapping data points or very close neighborhoods. At the end, a collection of same or similar points are replaced by any of the point (in the above answer by a first point) belonging to the same set. This helps to reduce redundancy from a point cloud. – Sanchit Aug 02 '16 at 10:01
  • @Sanchit aha, that’s a good point, instead of picking the “first” point (actually it could be effectively random, since it depends on how Python stores the points in a `set`) as representative of each `thresh`-sized neighborhood, the function could allow the user to specify how to pick that point, e.g., use the “median” or the point closest to the centroid, etc. – Ahmed Fasih Aug 02 '16 at 14:35
  • Sure. No doubt. I just mentioned the first point since this is what your program is doing which is completely fine. – Sanchit Aug 02 '16 at 15:17
  • Just a correction—I wrongly said above that the row that would be picked for each `thresh`-cluster would be random because of the unordered nature of `set`. Of course that’s a brainfart on my part, the `set` stores tuples of indexes that are in the `thresh`-neighborhood, so this `findRows` *does* in fact return, for each `thresh`-cluster, the first row in it. – Ahmed Fasih Aug 02 '16 at 16:50
4

Why not use drop_duplicates from pandas:

>>> timeit pd.DataFrame(image.reshape(-1,3)).drop_duplicates().values
1 loops, best of 3: 3.08 s per loop

>>> timeit np.vstack({tuple(r) for r in image.reshape(-1,3)})
1 loops, best of 3: 51 s per loop
kalu
  • 2,594
  • 1
  • 21
  • 22
  • I actually love this answer. Sure, it doesn't use numpy directly, but to me it's the one that's easiest to understand while being fast. – noctilux May 12 '17 at 02:58
3

The numpy_indexed package (disclaimer: I am its author) wraps the solution posted by Jaime in a nice and tested interface, plus many more features:

import numpy_indexed as npi
new_a = npi.unique(a)  # unique elements over axis=0 (rows) by default
Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
1

Based on the answer in this page I have written a function that replicates the capability of MATLAB's unique(input,'rows') function, with the additional feature to accept tolerance for checking the uniqueness. It also returns the indices such that c = data[ia,:] and data = c[ic,:]. Please report if you see any discrepancies or errors.

def unique_rows(data, prec=5):
    import numpy as np
    d_r = np.fix(data * 10 ** prec) / 10 ** prec + 0.0
    b = np.ascontiguousarray(d_r).view(np.dtype((np.void, d_r.dtype.itemsize * d_r.shape[1])))
    _, ia = np.unique(b, return_index=True)
    _, ic = np.unique(b, return_inverse=True)
    return np.unique(b).view(d_r.dtype).reshape(-1, d_r.shape[1]), ia, ic
Arash_D_B
  • 501
  • 1
  • 5
  • 11
1

Beyond @Jaime excellent answer, another way to collapse a row is to uses a.strides[0] (assuming a is C-contiguous) which is equal to a.dtype.itemsize*a.shape[0]. Furthermore void(n) is a shortcut for dtype((void,n)). we arrive finally to this shortest version :

a[unique(a.view(void(a.strides[0])),1)[1]]

For

[[0 1 1 1 0 0]
 [1 1 1 0 0 0]
 [1 1 1 1 1 0]]
B. M.
  • 18,243
  • 2
  • 35
  • 54
0

np.unique works given a list of tuples:

>>> np.unique([(1, 1), (2, 2), (3, 3), (4, 4), (2, 2)])
Out[9]: 
array([[1, 1],
       [2, 2],
       [3, 3],
       [4, 4]])

With a list of lists it raises a TypeError: unhashable type: 'list'

codeape
  • 97,830
  • 24
  • 159
  • 188
0

For general purpose like 3D or higher multidimensional nested arrays, try this:

import numpy as np

def unique_nested_arrays(ar):
    origin_shape = ar.shape
    origin_dtype = ar.dtype
    ar = ar.reshape(origin_shape[0], np.prod(origin_shape[1:]))
    ar = np.ascontiguousarray(ar)
    unique_ar = np.unique(ar.view([('', origin_dtype)]*np.prod(origin_shape[1:])))
    return unique_ar.view(origin_dtype).reshape((unique_ar.shape[0], ) + origin_shape[1:])

which satisfies your 2D dataset:

a = np.array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])
unique_nested_arrays(a)

gives:

array([[0, 1, 1, 1, 0, 0],
   [1, 1, 1, 0, 0, 0],
   [1, 1, 1, 1, 1, 0]])

But also 3D arrays like:

b = np.array([[[1, 1, 1], [0, 1, 1]],
              [[0, 1, 1], [1, 1, 1]],
              [[1, 1, 1], [0, 1, 1]],
              [[1, 1, 1], [1, 1, 1]]])
unique_nested_arrays(b)

gives:

array([[[0, 1, 1], [1, 1, 1]],
   [[1, 1, 1], [0, 1, 1]],
   [[1, 1, 1], [1, 1, 1]]])
Tara
  • 345
  • 3
  • 7
  • Using the `unique` `return_index` as Jaime does should make that last `return` line simpler. Just index the orginal `ar` on the right axis. – hpaulj Aug 22 '16 at 22:24
0

None of these answers worked for me. I'm assuming as my unique rows contained strings and not numbers. However this answer from another thread did work:

Source: https://stackoverflow.com/a/38461043/5402386

You can use .count() and .index() list's methods

coor = np.array([[10, 10], [12, 9], [10, 5], [12, 9]])
coor_tuple = [tuple(x) for x in coor]
unique_coor = sorted(set(coor_tuple), key=lambda x: coor_tuple.index(x))
unique_count = [coor_tuple.count(x) for x in unique_coor]
unique_index = [coor_tuple.index(x) for x in unique_coor]
Community
  • 1
  • 1
mjp
  • 1,618
  • 2
  • 22
  • 37
0

We can actually turn m x n numeric numpy array into m x 1 numpy string array, please try using the following function, it provides count, inverse_idx and etc, just like numpy.unique:

import numpy as np

def uniqueRow(a):
    #This function turn m x n numpy array into m x 1 numpy array storing 
    #string, and so the np.unique can be used

    #Input: an m x n numpy array (a)
    #Output unique m' x n numpy array (unique), inverse_indx, and counts 

    s = np.chararray((a.shape[0],1))
    s[:] = '-'

    b = (a).astype(np.str)

    s2 = np.expand_dims(b[:,0],axis=1) + s + np.expand_dims(b[:,1],axis=1)

    n = a.shape[1] - 2    

    for i in range(0,n):
         s2 = s2 + s + np.expand_dims(b[:,i+2],axis=1)

    s3, idx, inv_, c = np.unique(s2,return_index = True,  return_inverse = True, return_counts = True)

    return a[idx], inv_, c

Example:

A = np.array([[ 3.17   9.502  3.291],
  [ 9.984  2.773  6.852],
  [ 1.172  8.885  4.258],
  [ 9.73   7.518  3.227],
  [ 8.113  9.563  9.117],
  [ 9.984  2.773  6.852],
  [ 9.73   7.518  3.227]])

B, inv_, c = uniqueRow(A)

Results:

B:
[[ 1.172  8.885  4.258]
[ 3.17   9.502  3.291]
[ 8.113  9.563  9.117]
[ 9.73   7.518  3.227]
[ 9.984  2.773  6.852]]

inv_:
[3 4 1 0 2 4 0]

c:
[2 1 1 1 2]
Ting On Chan
  • 121
  • 2
-1

Lets get the entire numpy matrix as a list, then drop duplicates from this list, and finally return our unique list back into a numpy matrix:

matrix_as_list=data.tolist() 
matrix_as_list:
[[1, 1, 1, 0, 0, 0], [0, 1, 1, 1, 0, 0], [0, 1, 1, 1, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]]

uniq_list=list()
uniq_list.append(matrix_as_list[0])

[uniq_list.append(item) for item in matrix_as_list if item not in uniq_list]

unique_matrix=np.array(uniq_list)
unique_matrix:
array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 1, 0]])
Mahdi Ghelichi
  • 1,090
  • 14
  • 23
-3

The most straightforward solution is to make the rows a single item by making them strings. Each row then can be compared as a whole for its uniqueness using numpy. This solution is generalize-able you just need to reshape and transpose your array for other combinations. Here is the solution for the problem provided.

import numpy as np

original = np.array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])

uniques, index = np.unique([str(i) for i in original], return_index=True)
cleaned = original[index]
print(cleaned)    

Will Give:

 array([[0, 1, 1, 1, 0, 0],
        [1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 0]])

Send my nobel prize in the mail

Dave Pena
  • 50
  • 1
  • Very inefficient and error prone, e.g. with different print options. The other options are clearly preferable. – Michael Nov 28 '16 at 18:18
-3
import numpy as np
original = np.array([[1, 1, 1, 0, 0, 0],
                     [0, 1, 1, 1, 0, 0],
                     [0, 1, 1, 1, 0, 0],
                     [1, 1, 1, 0, 0, 0],
                     [1, 1, 1, 1, 1, 0]])
# create a view that the subarray as tuple and return unique indeies.
_, unique_index = np.unique(original.view(original.dtype.descr * original.shape[1]),
                            return_index=True)
# get unique set
print(original[unique_index])
YoungLearnsToCoding
  • 427
  • 1
  • 3
  • 10