Generating numpy array of indices for a deduplicated set of points

Question

I have an array of a minimum of 10s of thousands of points (up to 3 billion) some of which are duplicated. I'd like to deduplicate the points and generate an index array which retains the original sequence of the duplicated points.

For example:

x = [(0, 0),  # (x1, y1)
     (1, 0),  # (x2, y2)
     (1, 1),  # (x3, y3)
     (0, 0)]  # (x4, y4)

Deduplicating x, we have y:

y = list(set(x)) = [(1, 0),  # (x2, y2)
                    (0, 0),  # (x1, y1) and (x4, y4)
                    (1, 1)]  # (x3, y3)

And then we would have a resulting index array, z:

z = [1,  # (x1, y1) 
     0,  # (x2, y2)
     2,  # (x3, y3)
     1]  # (x4, y4)

Is there a numpy-like way of obtaining z? Here's a brute-force implementation:

z = []
for each_point in x:
    index = y.index(each_point)
    z.append(index)

Don't want to be pedantic but I don't see any `numpy` in here. More important: I don't think you can create a `numpy` array containing tuples as elements. Finally: `numpy.where` is the function you could be using on a numpy `array` but it's expensive if you need to locate (many) elements. — emvee, Mar 09 '15 at 18:24
I think using numpy in the question would be obfuscating the actual question, which is really looking for parity with that brute force implementation. I looked at where before posting, but I didn't see an example that matched up with what I wanted to do. — Brian Bruggeman, Mar 09 '15 at 18:27
I think [this](http://stackoverflow.com/a/18197790/1461210) is almost exactly what you're looking for — ali_m, Mar 09 '15 at 18:29
Actually, I had the same (type of) problem some time ago and I also didn't find a good solution. The python list + set + dict 'brute force' approach was (a lot) faster than using numpy; indexing into a numpy array is horribly slow. — emvee, Mar 09 '15 at 18:30

score 2 · Accepted Answer · edited May 23 '17 at 12:24

2

x2 = np.ascontiguousarray(x).view(np.dtype((np.void, x.dtype.itemsize * x.shape[1])))
y_temp, z = np.unique(x2, return_inverse=True)
y = y_temp.view(dtype='int64').reshape(len(y_temp), 2)
print(y)
print(z)

yields

[[0 0]
 [1 0]
 [1 1]]

and

[0 1 2 0]

Credit: Find unique rows in numpy.array

edited May 23 '17 at 12:24

Community

1
1

answered Mar 09 '15 at 18:40

Alex

18,484
8
60
80

Thanks Alex. I think this provides 'z', but I need both y and z. If I understand what happened here, I would then need to construct 'y' from a sorted 'z'. y = np.array(sorted(tuple(x[idx]) for idx in np.unique(z))) – Brian Bruggeman Mar 09 '15 at 19:44
Not at all, I just ignored the unpacked values (and only kept the indices). See updated answer. – Alex Mar 09 '15 at 19:55

score 2 · Answer 2 · answered Apr 02 '16 at 19:56

This problem can be elegantly solved using the numpy_indexed package (disclaimer: I am its author). It is similar to the solution posted by Alex under the hood; but with a nicer interface and more tests:

import numpy_indexed as npi
y, z = npi.unique(x, return_inverse=True)

Generating numpy array of indices for a deduplicated set of points

2 Answers2