Fastest way to find all indexes of matching values between two 1D arrays (with duplicates)

Question

Question description

Lets say we have two simple arrays:

query = np.array([100, 4000, 500, 700, 400, 100])
match = np.array([6, 100, 4000, 100, 10, 8, 10])

I want to find the indexes of all matching values between the query and match. So in this case the result would be:

value   query   match
100        0    1
100        0    3
100        5    1
100        5    3
4000       1    2

In reality these arrays will contain millions of items

"Stupid" loop solution

qs = []
query_locs = []
match_locs = []

for i in np.arange(query.size):
    q = query[i]
    # Get matching indexes in "match"
    match_loc = np.where(match == q)[0]
    n = match_loc.size
    # Update location arrays
    match_locs.extend(match_loc)
    query_locs.extend(np.repeat(i,n))
    # Store the matching value
    qs.extend(np.repeat(q,n))

result = np.vstack((qs, query_locs, match_locs)).T
print(result)
 [[ 100    0    1]
 [ 100    0    3]
 [4000    1    2]
 [ 100    5    1]
 [ 100    5    3]]

(Maybe numba could make this loop pretty fast however when I tried this I got some errors about the signatures, so not sure about that)

Numpy buildins

There are quite some buildin numpy function to solve this problem for unique values, like using searchsorted, intersect1d, however, as also described in the doc, they "Return the sorted, unique values" and thus do not take duplicates into account. Some examples on StackOverflow for this problem with unique values:

I could imagine there would be a faster way to do this with numpy instead of a loop, so curious to see an answer!

score 3 · Answer 1 · answered Jan 14 '22 at 01:25

You may transform 1d-arrays to dataframes and make a join, like this:

query = np.array([100, 4000, 500, 700, 400, 100])
match = np.array([6, 100, 4000, 100, 10, 8, 10])
dfquery = pd.DataFrame(range(len(query)), index=query, columns=['query'])
dfmatch = pd.DataFrame(range(len(match)), index=match, columns=['match'])
dfquery.join(dfmatch, how='inner')

Result:

    query   match
100     0       1
100     0       3
100     5       1
100     5       3
4000    1       2

This gave me some good ideas, and it's fast too, thanks! Even faster when using `set_index()` and `sort_index()` — CodeNoob, Jan 18 '22 at 13:14

KonstantinosKokos · Answer 2 · 2022-01-14T00:55:51.717

0

You could hack around it with newaxis:

>>> comparison = np.equal(query[:, np.newaxis], match[np.newaxis, :])
array([[False,  True, False,  True, False, False, False],
       [False, False,  True, False, False, False, False],
       [False, False, False, False, False, False, False],
       [False, False, False, False, False, False, False],
       [False, False, False, False, False, False, False],
       [False,  True, False,  True, False, False, False]])

which essentially creates the cartesian product (query x matches) (note the memory cost) and then applies the binary function np.equal to cast each element in the product space to a bool efficiently. The output can be interpreted by reading it row-wise as: query element i is equal to match element j whenever comparison[i, j] is True. You can gather the indices of all True pairs with:

list(zip(*comparison.nonzero()))
[(0, 1), (0, 3), (1, 2), (5, 1), (5, 3)]

ps: If the arrays are too long to construct the product, iterating over them element-wise is your only option.

edited Jan 14 '22 at 00:55

answered Jan 14 '22 at 00:44

KonstantinosKokos

3,369
1
11
21

With millions of items this for sure won't be achievable with classical hardware. – mozway Jan 14 '22 at 03:39
Sure, it scales badly in terms of memory costs, but it's the fastest way to go about it if your array sizes are manageable. – KonstantinosKokos Jan 14 '22 at 10:27
1

I just said it because OP mentioned "*In reality these arrays will contain millions of items*", I love numpy broadcasting, but clearly not applicable here ;) – mozway Jan 14 '22 at 10:29
ohh, I actually missed that line – KonstantinosKokos Jan 14 '22 at 10:32

Fastest way to find all indexes of matching values between two 1D arrays (with duplicates)

2 Answers2