Stable conversion of a multi-column (2D) numpy array to an indicator vector

Question

I often need to convert a multi-column (or 2D) numpy array into an indicator vector in a stable (i.e., order preserved) manner.

For example, I have the following numpy array:

import numpy as np
arr = np.array([
   [2, 20, 1],
   [1, 10, 3],
   [2, 20, 2],
   [2, 20, 1],
   [1, 20, 3],
   [2, 20, 2],
])

The output I like to have is:

indicator = [0, 1, 2, 0, 3, 2]

How can I do this (preferably using numpy only)?

Notes:

I am looking for a high performance (vectorized) approach as the arr (see the example above) has millions of rows in a real application.
I am aware of the following auxiliary solutions, but none is ideal. It would be nice to hear expert's opinion.

My thoughts so far:

1. Numpy's unique: This would not work, as it is not stable:

   arr_unq, indicator = np.unique(arr, axis=0, return_inverse=True)
   print(arr_unq)
   # output 1:
   # [[ 1 10  3]
   #  [ 1 20  3]
   #  [ 2 20  1]
   #  [ 2 20  2]] 

   print(indicator)
   # output 2:
   # [2 0 3 2 1 3]

Notice how the indicator starts from 2. This is because unique function returns a "sorted" array (see output 1). However, I would like it to start from 0.

Of course I can use LabelEncoder from sklearn to convert the items in a manner that they start from 0 but I feel that there is a simple numpy trick that I can use and therefore avoid adding sklearn dependency to my program. Or I can resolve this by a dictionary mapping like below, but I can imagine that there is a better or more elegant solution:

dct = {}
for idx, item in enumerate(indicator):
   if item not in dct:
       dct[item] = len(dct)
   indicator[idx] = dct[item]
    
print(indicator)
# outputs:
# [0 1 2 0 3 2]

2. Stabilizing numpy's unique output: This solution is already posted in stackoverflow and correctly returns an stable unique array. But I do not know how to convert the returned indicator vector (returned when return_inverse=True) to represent the values in an stable order starting from 0.

3. Pandas's get_dummies: function. But it returns a "hot encoding" (matrix of indicator values). In contrast, I would like to have an indicator vector. It is indeed possible to convert the "hot encoding" to the indicator vector by few lines of code and data manipulation. But again that approach is not going to be highly efficient.

@MadPhysicist yes. It would be nice to have `arr_unq`, but that is not a priority. — Amin.A, Jun 22 '21 at 11:47

score 2 · Accepted Answer · edited Jun 22 '21 at 16:48

In addition to return_inverse, you can add the return_index option. This will tell you the first occurrence of each sorted item:

unq, idx, inv = np.unique(arr, axis=0, return_index=True, return_inverse=True)

Now you can use the fact that np.argsort is its own inverse to fix the order. Note that idx.argsort() places unq into sorted order. The corrected result is therefore

indicator = idx.argsort().argsort()[inv]

And of course the byproduct

unq = unq[idx.argsort()]

Of course there's nothing special about these operations to 2D.

A Note on the Intuition

Let's say you have an array x:

x = np.array([7, 3, 0, 1, 4])

x.argsort() is the index that tells you what elements of x are placed at each of the locations in the sorted array. So

i = x.argsort() # 2, 3, 1, 4, 0

But how would you get from np.sort(x) back to x (which is the problem you express in #2)?

Well, it happens that i tells you the original position of each element in the sorted array: the first (smallest) element was originally at index 2, the second at 3, ..., the last (largest) element was at index 0. This means that to place np.sort(x) back into its original order, you need the index that puts i into sorted order. That means that you can write x as

np.sort(x)[i.argsort()]

Which is equivalent to

x[i][i.argsort()]

OR

x[x.argsort()][x.argsort().argsort()]

So, as you can see, np.argsort is effectively its own inverse: argsorting something twice gives you the index to put it back in the original order.

Thanks for the edit @MadPhysicist, it is now working very nicely and as expected! Meanwhile, I have to admit I still don't really get the intuition behind how it works (the .argsort().argsort() section in particular). So if possible, please add some more notes to make the intuition more clear. Hopefully somebody else uses this too :) — Amin.A, Jun 22 '21 at 13:58
Brilliant @Mad Physicist, really made my day. Greatly appreciating the help! I never knew about the .argsort().argsort(). Thanks for sharing :) — Amin.A, Jun 22 '21 at 16:28
@Amin.A. I got pretty familiar with it at some point, but it still blows my mind every time. I always have to work it out from scratch :) — Mad Physicist, Jun 22 '21 at 16:49

Stable conversion of a multi-column (2D) numpy array to an indicator vector

1 Answers1