2

I have a numpy array A which contains unique IDs that can be in any order - e.g. A = [1, 3, 2]. I have a second numpy array B, which is a record of when the ID is used - e.g. B = [3, 3, 1, 3, 2, 1, 2, 3, 1, 1, 2, 3, 3, 1]. Array B is always much longer than array A.

I need to find the indexed location of the ID in A for each time the ID is used in B. So in the example above my returned result would be: result = [1, 1, 0, 1, 2, 0, 2, 1, 0, 0, 2, 1, 1, 0].

I've already written a simple solution that gets the correct result using a for loop to append the result to a new list and using numpy.where, but I can't figure out the correct syntax to vectorize this.

import numpy as np
A = np.array([1, 3, 2])
B = np.array([3, 3, 1, 3, 2, 1, 2, 3, 1, 1, 2, 3, 3, 1])

IdIndxs = []
for ID in B:
    IdIndxs.append(np.where(A == ID)[0][0])

IdIndxs = np.array(IdIndxs)

Can someone come up with a simple vector based solution that runs quickly - the for loop becomes very slow when running on a typical problem where is A is of the size of 10K-100K elements and B is some multiple, usually 5-10x larger than A.

I'm sure the solution is simple, but I just can't see it today.

jpmorr
  • 500
  • 5
  • 25
  • If you values were sorted, you could have used `digitize` or `searchsorted`. But, alas! – Sheldore Jun 02 '19 at 14:43
  • @Sheldore I've been trying to use some version of sorting and np.where(A==B) but can't seem to get the right answer. Is it a simple solution? My brained is a little foggy today. – jpmorr Jun 02 '19 at 14:50

3 Answers3

0

The numpy-indexed library (disclaimer: I am its author) was designed to provide these type of vectorized operations where numpy for some reason does not. Frankly given how common this vectorized list.index equivalent is useful it definitely ought to be in numpy; but numpy is a slow-moving project that takes backwards compatibility very seriously, and I dont think we will see this until numpy2.0; but until then this is pip and conda installable with the same ease.

import numpy_indexed as npi
idx = npi.indices(A, B)
Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
  • Thanks for pointing me to this library. I'm surprised there isn't a simple numpy operation for this and will look into utilising this library. Can it be solved without using a 3rd party package - I'm not sure I'll have access to edit installed packages on work machines. – jpmorr Jun 02 '19 at 14:45
  • Feel free to copy the source code over; though that would be far from the most compact possible solution; this implementation is geared much more towards generality. Though it may serve as inspiration for a compact solution too. – Eelco Hoogendoorn Jun 02 '19 at 14:57
0

You can use this:

import numpy as np

# test data
A = np.array([1, 3, 2])
B = np.array([3, 3, 1, 3, 2, 1, 2, 3, 1, 1, 2, 3, 3, 1])

# get indexes
sorted_keys = np.argsort(A)
indexes = sorted_keys[np.searchsorted(A, B, sorter=sorted_keys)]

Output:

[1 1 0 1 2 0 2 1 0 0 2 1 1 0]
Zaraki Kenpachi
  • 5,510
  • 2
  • 15
  • 38
  • `sorter=sorted_keys` was what I was missing – Sheldore Jun 02 '19 at 14:57
  • Do you need `searchsorted`? You coud do: `indexes=np.argsort(A)[B-1]` – Brenlla Jun 02 '19 at 15:05
  • @Zaraki Thanks for this. Seems to works as I hoped. Although just testing on my data it fails since it seem that my data is A = np.array([[1], [3], [2]]), but I should be able to resolve this. – jpmorr Jun 02 '19 at 15:09
0

Reworking your logic but using a list comprehension and numpy.fromiter which should boost performance.

IdIndxs = np.fromiter([np.where(A == i)[0][0] for i in B], B.dtype)

About performance

I've done a quick test comparing fromiter with your solution, and I do not see such boost in performance. Even using a B array of millions of elements, they are of the same order.

Valentino
  • 7,291
  • 6
  • 18
  • 34