9

What is a faster way of finding unique x,y points (removing duplicates) in a numpy array like:

points = numpy.random.randint(0, 5, (10,2))

I thought of converting points to a complex numbers and then checking for unique, but that seems rather convoluted:

b = numpy.unique(points[:,0] + 1j * points[:,1])
points = numpy.column_stack((b.real, b.imag))
Benjamin
  • 11,560
  • 13
  • 70
  • 119
  • 1
    If you don't require order to be preserved, use tuples for the points and convert the list to a set. – wim Nov 03 '11 at 03:41
  • I need the result to be a numpy array, so that seems like a lot of conversions. – Benjamin Nov 03 '11 at 03:44
  • Is there a real reason why the simple solution `numpy.vstack([numpy.array(u) for u in set([tuple(p) for p in points])])` is not fast enough? – wim Nov 03 '11 at 03:50
  • Thinking there must be a faster way than list comprehension when it gets to longer lists of points, no? – Benjamin Nov 03 '11 at 03:56
  • Wim's method is faster, especially for larger arrays. Probably because it doesn't bother sorting the result. I added some timeit benchmarks to my post. Perhaps wim will post his solution as an answer? – unutbu Nov 03 '11 at 04:25

2 Answers2

8

I would do it like this:

numpy.array(list(set(tuple(p) for p in points)))

For the fast solution in the most general case, maybe this recipe would interest you: http://code.activestate.com/recipes/52560-remove-duplicates-from-a-sequence/

wim
  • 338,267
  • 99
  • 616
  • 750
  • I have a similar problem for which this works a charm but it comes out unsorted, even though the data went in sorted. Why does this happen? – Warrick Sep 19 '12 at 13:47
  • 1
    Because the ordering is destroyed by `set`, which is an unordered collection. – wim Sep 19 '12 at 14:21
  • There is nothing stopping you from sorting the output, though :) The set is only used as an intermediate step to remove dupes here. – wim Sep 19 '12 at 14:23
7

I think you have a very good idea here. Think about the underlying block of memory used to represent the data in points. We tell numpy to regard that block as representing an array of shape (10,2) with dtype int32 (32-bit integers), but it is almost costless to tell numpy to regard that same block of memory as representing an array of shape (10,) with dtype c8 (64-bit complex).

So the only real cost is calling np.unique, followed by another virtually costless call to view and reshape:

import numpy as np
np.random.seed(1)
points = np.random.randint(0, 5, (10,2))
print(points)
print(len(points))

yields

[[3 4]
 [0 1]
 [3 0]
 [0 1]
 [4 4]
 [1 2]
 [4 2]
 [4 3]
 [4 2]
 [4 2]]
10

while

cpoints = points.view('c8')
cpoints = np.unique(cpoints)
points = cpoints.view('i4').reshape((-1,2))
print(points)
print(len(points))

yields

[[0 1]
 [1 2]
 [3 0]
 [3 4]
 [4 2]
 [4 3]
 [4 4]]
7

If you don't need the result to be sorted, wim's method is faster (You might want to consider accepting his answer...)

import numpy as np
np.random.seed(1)
N=10000
points = np.random.randint(0, 5, (N,2))

def using_unique():
    cpoints = points.view('c8')
    cpoints = np.unique(cpoints)
    return cpoints.view('i4').reshape((-1,2))

def using_set():
    return np.vstack([np.array(u) for u in set([tuple(p) for p in points])])

yields these benchmarks:

% python -mtimeit -s'import test' 'test.using_set()'
100 loops, best of 3: 18.3 msec per loop
% python -mtimeit -s'import test' 'test.using_unique()'
10 loops, best of 3: 40.6 msec per loop
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 2
    np.unique sorts the result. Are you looking for a method which keeps the remaining elements in order? – unutbu Nov 03 '11 at 03:37
  • No, I mean, I obtain the wrong result: cpoints.shape is still 10,2 and the final points do not match the original. – Benjamin Nov 03 '11 at 03:41
  • I've edited the post to show it does work for at least one seed. Can you show an example where it does not work (with a seed so the problem is reproducible)? – unutbu Nov 03 '11 at 03:45
  • Not what I'm seeing... Pyhton 2.7.1, numpy 2.0.0. I'm getting the same result with Python 2.6.7 and numpy 1.5.1 and with Python 2.7.1 and numpy 1.5.1 (on mac), which is [[0,0],[1,0],[2,0],[3,0],[4,0]]... Hmm. – Benjamin Nov 03 '11 at 03:53
  • But runs fine on my old Python 2.5.6 with numpy 1.2.1. Any idea what might be causing this? Thanks for a nice answer, btw. – Benjamin Nov 03 '11 at 04:02
  • I'm using Python 2.7.2, and numpy 1.5.1. I checked the documentation for [np.unique on numpy 2.0.0](http://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html#numpy-unique) versus [np.unique on numpy 1.5.1](http://docs.scipy.org/doc/numpy-1.5.x/reference/generated/numpy.unique.html). They appear the same, so I have no idea why they should behave differently. It would help immensely if you could provide a simple example of an array on which the method does not work (either be specifying `np.random.seed(...)` or a concrete `points` array.) – unutbu Nov 03 '11 at 04:13
  • 1
    Not that anyone will C, but recarrays might be nicer then just a larger dtype (and more general) – seberg Sep 19 '12 at 15:00