2

I have two lists from which I need to find the indices associated with unique pairs (all the SO posts I could find are only interested in the pairs themselves). I've been trying to use numpy.unique to do so, but am hitting an oddity. I zipped the lists to create a list of tuples, which then set() and np.unique() successfully pare down to only the unique pairs, but what I want is the indices into the original list. The documentation for unique indicates that it will return those if return_inverse=True. However, I am getting different levels of "flattening" if that is set or not.

In this example I use strings just to avoid any comparison issues, in reality they are floats.

import numpy as np

l_1 = ['12.34', '12.34', '12.34', '12.34', '56.78', '56.78', '90.12', '90.12']
l_2 = ['-1.23', '-1.23', '-4.56', '-4.56', '-6.78', '-6.78', '-9.01', '-9.01']
ll = zip(l_1, l_2)

ull1 = np.unique(ll)

ull2, inds = np.unique(ll, return_inverse=True)

In the first case the pairs are preserved as a second dimension in the output. In the second case even the tuples are flattened out, thus destroying the pairs.

In [1]: ull1
Out[1]: 
array([['-9.01', '90.12'],
       ['-1.23', '12.34'],
       ['-6.78', '56.78'],
       ['-4.56', '12.34']], 
      dtype='|S5')

In [2]: ull2
Out[2]:
array(['-1.23', '-4.56', '-6.78', '-9.01', '12.34', '56.78', '90.12'], 
      dtype='|S5')

Is this done on purpose? Is there some way to make unique give me the indices that I want in the first case (which would be something like [[6,7], [0,1], [4,5], [2,3]])? I can't tell from the documentation if the former or latter behavior is the odd one out.

I need the indices to operate on other values from similar lists. If I had access to pandas I would use it, but the computer I have to run on only has a very old version of numpy and no pandas. However, this same thing still happens in numpy 1.8.1. I know that I could do something like the following:

sll = list(set(ll))
for i in range(len(sll)):
    inds = np.where([val == sll[i] for val in ll])
    # I do my operations here using inds

but I'm hoping there may be something more elegant?

Ajean
  • 5,528
  • 14
  • 46
  • 69
  • @moarningsun Ahah, I had found that question but I didn't see the `idx` in that one answer until you specifically called it out. I think I got befuddled by the length and number of the answers there... – Ajean Sep 05 '14 at 19:46
  • 1
    Right, it would've been better if I linked to the specific answer: http://stackoverflow.com/a/16973510/2379410 –  Sep 05 '14 at 19:51

1 Answers1

4

The source code for numpy.unique in version 1.8.1 starts with the following:

try:
    ar = ar.flatten()
except AttributeError:
    if not return_inverse and not return_index:
        return np.sort(list(set(ar)))
    else:
        ar = np.asanyarray(ar).flatten()

If the input isn't an array and return_inverse and return_index are not present, the routine delegates to Python built-ins to find unique elements. The way it does so is bugged; it does not perform the flattening that the documentation guarantees:

Input array. This will be flattened if it is not already 1-D.

As Jaime points out in the comments, this has been fixed in the current NumPy master branch.


I believe you can get your desired result by packing your two lists into a structured array. I don't know whether numpy.unique takes structured arrays, but if not, you can replicate its behavior by using numpy.sort, which documents how to use it with structured arrays.

user2357112
  • 260,549
  • 28
  • 431
  • 505
  • Excellent! I hadn't even thought of structured arrays, but I just gave it a try and it does exactly what I want. Thanks for the clarification on the numpy docs, I also had been confused by the contradiction of `This will be flattened if it is not already 1-D.` – Ajean Sep 05 '14 at 19:44
  • 2
    That buggy behavior was fixed a while back, see the master source code [here](https://github.com/numpy/numpy/blob/master/numpy/lib/arraysetops.py#L96) which is also what went into the 1.9 beta. – Jaime Sep 05 '14 at 20:22
  • @Jaime: Ah, good to know. I didn't think to check the development version of the code. – user2357112 Sep 05 '14 at 23:22