1

Say I have these two arrays:

dictionary = np.array(['a', 'b', 'c'])
array = np.array([['a', 'a', 'c'], ['b', 'b', 'c']])

And I'd like to replace every element in array with the index of its value in dictionary. So:

for index, value in enumerate(dictionary):
    array[array == value] = index
array = array.astype(int)

To get:

array([[0, 0, 2],
       [1, 1, 2]])

Is there a vectorized way to do this? I know that if array already contained indices and I wanted the strings in dictionary, I could just do dictionary[array]. But I effectively need a "lookup" of strings here.

(I also see this answer, but wondering if something new were available since 2010.)

Community
  • 1
  • 1
capitalistcuttle
  • 1,709
  • 2
  • 20
  • 28

1 Answers1

2

If your dictionary is sorted, and dictionary and array contain the same elements, np.unique does the trick

uniq, inv = np.unique(array, return_inverse=True)
result = inv.reshape(array.shape)

If some elements are missing in array:

uniq, inv = np.unique(np.r_[dictionary, array.ravel()], return_inverse=True)
result = inv[len(dictionary):].reshape(array.shape)

General case:

uniq, inv = np.unique(np.r_[dictionary, array.ravel()], return_inverse=True)
back = np.empty_like(inv[:len(dictionary)])
back[inv[:len(dictionary)]] = np.arange(len(dictionary))
result=back[inv[len(dictionary):]].reshape(array.shape)

Explanation: np.unique in the form we are using it here returns the unique elements in sorted order and the indices into this sorted list of each element of the argument. So to get the indices into the original dictionary we need to remap the indices. We know that uniq[inv[:len(uniq)]] == dictionary. Therefore we must solve X[inv[:len(uniq)]] == np.arange(len(uniq)), which is what the code does.

Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • Very neat! Thank you. Will wait just a bit to see if anyone has a general solution for non-sorted dictionaries. (You sort of circumvent the need to look up the dictionary altogether.) – capitalistcuttle Feb 23 '17 at 03:27
  • @capitalistcuttle The last bit ("General case") is for non-sorted dictionaries. – Paul Panzer Feb 23 '17 at 03:33
  • Ah, right. Hadn't gotten my head around that yet :). And I see it avoids sorting the dictionary and uses the unique()'s sorted output. Great! BTW, there's an extra right square bracket in 2nd line of 2nd bit. And a couple of lines describing the indirection of `back` would be helpful. – capitalistcuttle Feb 23 '17 at 04:43