Convert to indices array in numpy

Question

Similar to retrun_inverse in numpy.unique,

if I have a numpy array a:[['a' 'b'] ['b' 'c'] ['c' 'c'] ['c' 'b']],

I want to convert array b:[['b' 'c'] ['a' 'b'] ['c' 'c'] ['a' 'b'] ['c' 'c']] to [1 0 2 0 2].

Is there any smart way to convert it?

norok2 · Accepted Answer · 2020-05-22T18:46:09.783

Perhaps this is more easily done with plain lists (which you can get from NumPy arrays with .tolist() method):

a = [['a', 'b'], ['b', 'c'], ['c', 'c'], ['c', 'b']]
b = [['b', 'c'], ['a', 'b'], ['c', 'c'], ['a', 'b'], ['c', 'c']]

print([a.index(x) for x in b])
# [1, 0, 2, 0, 2]

or, writing it as a function and assuming NumPy arrays inputs and outputs and handling the case where a needle is not in the haystack:

import numpy as np


def find_by_list(haystack, needles):
    haystack = haystack.tolist()
    result = []
    for needle in needles.tolist():
        try:
            result.append(haystack.index(needle))
        except ValueError:
            result.append(-1)
    return np.array(result)

This comes out approximately as fast as a more NumPy-aware solutions based on np.where() (provided that the reduction operated by np.all() can be done on the first axis), e.g.:

import numpy as np


def find_by_np(haystack, needles, haystack_axis=-1, needles_axis=-1, keepdims=False):
    if haystack_axis:
        haystack = haystack.swapaxes(0, haystack_axis)
    if needles_axis:
        needles = needles.swapaxes(0, needles_axis)
    n = needles.shape[0]
    m = haystack.ndim - 1
    shape = haystack.shape[1:]
    result = np.full((m,) + needles.shape[1:], -1)
    haystack = haystack.reshape(n, -1)
    needles = needles.reshape(n, -1)
    _, match, index = np.nonzero(np.all(
        haystack[:, None, :] == needles[:, :, None],
        axis=0, keepdims=True))
    result.reshape(m, -1)[:, match] = np.unravel_index(index, shape)
    if not keepdims and result.shape[0] == 1:
        result = np.squeeze(result, 0)
    return result

but both are slower than simple loops accelerated with Numba JIT, e.g.:

import numpy as np
import numba as nb


def find_by_loop(haystack, needles):
    n, m = haystack.shape
    l, m_ = needles.shape
    result = np.full(l, -1)
    if m != m_:
        return result
    for i in range(l):
        for j in range(n):
            is_equal = True
            for k in range(m):
                if haystack[j, k] != needles[i, k]:
                    is_equal = False
                    break
            if is_equal:
                break
        if is_equal:
            result[i] = j
    return result


find_by_nb = nb.jit(find_by_loop)
find_by_nb.__name__ = 'find_by_nb'

While they all give the same result:

funcs = find_by_list, find_by_np, find_by_loop, find_by_nb


a = np.array([['a', 'b'], ['b', 'c'], ['c', 'c'], ['c', 'b']])
b = np.array([['b', 'c'], ['a', 'b'], ['c', 'c'], ['a', 'b'], ['c', 'c']])
print(a.shape, b.shape)
for func in funcs:
    print(f'{func.__name__:>12s}(a, b) = {func(a, b)}')
# find_by_list(a, b) = [1 0 2 0 2]
#   find_by_np(a, b) = [1 0 2 0 2]
# find_by_loop(a, b) = [1 0 2 0 2]
#   find_by_nb(a, b) = [1 0 2 0 2]

The timings comes as follows:

print(f'({"n":<4s}, {"m":<4s}, {"k":<4s})', end='  ')
for func in funcs:
    print(f'{func.__name__:>15s}', end='    ')
print()
for n, m, k in itertools.product((5, 50, 500), repeat=3):
    a = np.random.randint(0, 100, (k, n))
    b = np.random.randint(0, 100, (m, n))
    print(f'({n:<4d}, {m:<4d}, {k:<4d})', end='  ')
    for func in funcs:
        result = %timeit -n3 -r10 -q -o func(a, b)
        print(f'{result.best * 1e3:12.3f} ms', end='    ')
    print()
# (n   , m   , k   )     find_by_list         find_by_np       find_by_loop         find_by_nb    
# (5   , 5   , 5   )         0.008 ms           0.048 ms           0.021 ms           0.001 ms    
# (5   , 5   , 50  )         0.018 ms           0.031 ms           0.176 ms           0.001 ms    
# (5   , 5   , 500 )         0.132 ms           0.092 ms           1.754 ms           0.006 ms    
# (5   , 50  , 5   )         0.065 ms           0.031 ms           0.184 ms           0.001 ms    
# (5   , 50  , 50  )         0.139 ms           0.093 ms           1.756 ms           0.006 ms    
# (5   , 50  , 500 )         1.096 ms           0.684 ms          17.546 ms           0.049 ms    
# (5   , 500 , 5   )         0.658 ms           0.093 ms           1.871 ms           0.006 ms    
# (5   , 500 , 50  )         1.383 ms           0.699 ms          17.504 ms           0.051 ms    
# (5   , 500 , 500 )         9.102 ms           7.752 ms         177.754 ms           0.491 ms    
# (50  , 5   , 5   )         0.026 ms           0.061 ms           0.022 ms           0.001 ms    
# (50  , 5   , 50  )         0.054 ms           0.042 ms           0.174 ms           0.002 ms    
# (50  , 5   , 500 )         0.356 ms           0.203 ms           1.759 ms           0.006 ms    
# (50  , 50  , 5   )         0.232 ms           0.042 ms           0.185 ms           0.001 ms    
# (50  , 50  , 50  )         0.331 ms           0.205 ms           1.744 ms           0.006 ms    
# (50  , 50  , 500 )         1.332 ms           2.422 ms          17.492 ms           0.051 ms    
# (50  , 500 , 5   )         2.328 ms           0.197 ms           1.882 ms           0.006 ms    
# (50  , 500 , 50  )         3.092 ms           2.405 ms          17.618 ms           0.052 ms    
# (50  , 500 , 500 )        11.088 ms          18.989 ms         175.568 ms           0.479 ms    
# (500 , 5   , 5   )         0.205 ms           0.035 ms           0.023 ms           0.001 ms    
# (500 , 5   , 50  )         0.410 ms           0.137 ms           0.187 ms           0.001 ms    
# (500 , 5   , 500 )         2.800 ms           1.914 ms           1.894 ms           0.006 ms    
# (500 , 50  , 5   )         1.868 ms           0.138 ms           0.201 ms           0.001 ms    
# (500 , 50  , 50  )         2.154 ms           1.814 ms           1.902 ms           0.006 ms    
# (500 , 50  , 500 )         6.352 ms          16.343 ms          19.108 ms           0.050 ms    
# (500 , 500 , 5   )        19.798 ms           1.957 ms           2.020 ms           0.006 ms    
# (500 , 500 , 50  )        20.922 ms          13.571 ms          18.850 ms           0.052 ms    
# (500 , 500 , 500 )        35.947 ms         139.923 ms         189.747 ms           0.481 ms

Showing that Numba provides the fastest (and most memory efficient) solution, while its non-JIT accelerated version provides the slowest. The NumPy-based one and the list-based one come out somewhere in between, with varying speed. But for larger inputs, the list based one should be faster on average because it provides better short-circuiting.

score 0 · Answer 2 · answered May 21 '20 at 10:43

Not the most elegant solution, but it works:

Setting up (in the future, show the code to generate your example, it'll make it faster to answer):

import numpy as np
a = np.array([['a', 'b'], ['b', 'c'], ['c', 'c'], ['c', 'b']])
b = np.array([['b', 'c'], ['a', 'b'], ['c', 'c'], ['a', 'b'], ['c', 'c']])
desired_output = [1, 0, 2, 0, 2]

Using the numpy.where function (as in this related question: Is there a NumPy function to return the first index of something in an array?)

We use np.where per item in each row, multiply the boolean results, and use list comprehension to pass row by row:

output = [np.where((x[0]==a[:,0]) * (x[1]==a[:,1]))[0][0] for x in b]

It gives back your desired result.

score 0 · Answer 3 · answered May 21 '20 at 13:37

An interesting way to do things, perhaps?

a.append(None)
aa = np.array(a)[:-1]                # Note 1

b.append(None)
bb = np.array(b)[:-1]

ind_arr = bb[:, None] == aa          # Note 2
np.nonzero(ind_arr)[1]

Note 1: The first step is more like an overhead to get an object type 1-D array. Otherwise, numpy forces a 2-D str type array, which is unhelpful for this application. Read more about it in this answer. It also states a few alternatives.

Note 2: This creates a 2-D boolean mask with each element of aa compared with each element of bb for equality, like so: ind_arr[i, j] = (bb[i] == aa[j]).
The next line uses this mask and extracts the True values along axis 1 (where the comparison had evaluated to True). This is because aa values in the comparison mask are along axis 1.
Another discussion to understand this better.

If you're looking for speed though, for lists, norok2's answer is quite a lot faster. This perhaps, can have innovative applications. Cheers!

Convert to indices array in numpy

3 Answers3