8

I have two sets of coordinates and want to find out which coordinates of the coo set are identical to any coordinate in the targets set. I want to know the indices in the coo set which means I'd like to get a list of indices or of bools.

import numpy as np

coo = np.array([[1,2],[1,6],[5,3],[3,6]]) # coordinates
targets = np.array([[5,3],[1,6]]) # coordinates of targets

print(np.isin(coo,targets))

[[ True False]
 [ True  True]
 [ True  True]
 [ True  True]]

The desired result would be one of the following two:

[False True True False] # bool list
[1,2] # list of concerning indices

My problem is, that ...

  • np.isin has no axis-attribute so that I could use axis=1.
  • even applying logical and to each row of the output would return True for the last element, which is wrong.

I am aware of loops and conditions but I am sure Python is equipped with ways for a more elegant solution.

fountainhead
  • 3,584
  • 1
  • 8
  • 17
Hein Schnell
  • 322
  • 5
  • 15

3 Answers3

14

This solution will scale worse for large arrays, for such cases the other proposed answers will perform better.


Here's one way taking advantage of broadcasting:

(coo[:,None] == targets).all(2).any(1)
# array([False,  True,  True, False])

Details

Check for every row in coo whether or not it matches another in target by direct comparisson having added a first axis to coo so it becomes broadcastable against targets:

(coo[:,None] == targets)

array([[[False, False],
        [ True, False]],

       [[False, False],
        [ True,  True]],

       [[ True,  True],
        [False, False]],

       [[False, False],
        [False,  True]]])

Then check which ndarrays along the second axis have all values to True:

(coo[:,None] == targets).all(2)

array([[False, False],
       [False,  True],
       [ True, False],
       [False, False]])

And finally use any to check which rows have at least one True.

yatu
  • 86,083
  • 12
  • 84
  • 139
2

Here is a simple and intuitive solution that actually uses numpy.isin(), to match tuples, rather than match individual numbers:

# View as a 1d array of tuples
coo_view     = coo.view(dtype='i,i').reshape((-1,))
targets_view = targets.view(dtype='i,i').reshape((-1,))

result = np.isin(coo_view, targets_view)
print (result)
print(result.nonzero()[0])

Output:

[False  True  True False]
[1 2]

Notes:

  1. The creation of these views does not involve any copying of data.
  2. The dtype='i,i' specifies that we want each element of the view to be a tuple of two integers
fountainhead
  • 3,584
  • 1
  • 8
  • 17
2

The numpy_indexed package implements functionality of this type in a vectorized manner (disclaimer: I am its author). Sadly numpy lacks a lot of this functionality out of the box; I started numpy_indexed with the intention of having it merged into numpy, but there are some backwards compatibility concerns, and big packages like that tend to move slowly. So that hasnt happened in the last 3 years; but the python packaging ecosystem works so well nowadays that just adding one more package to your environment is just as simple, really.

import numpy_indexed as npi
bools = npi.in_(targets, coo)

This will have a time-complexity similar to that of the solution posted by @fountainhead (logarithmic rather than linear, as per the currently accepted answer), but also the npi library will give you the safety of automated tests, and a lot of other convenient options, should you decide to approach the problem from a slightly different angle.

Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
  • I'm not sure about the currently accepted answer having a linear time-complexity. When we start with two arrays of size `2m` numbers and `2n` numbers respectively, the broadcasted size will be `2mn`. So, the `==` comparison of the broadcasted arrays will always perform `2mn` numeric comparison. It will be **always** `2mn` comparisons, and **not just in the worst case**. (The worst case is when there are zero matches). In contrast, `isin`-based solutions stop further matches when each successful match occurs. – fountainhead Feb 24 '19 at 11:11
  • It's linear in both n and m; so quadratic if you scale both at the same time. That is indeed as brute-force as it gets, and smarter solutions can have a huge advantage in practice – Eelco Hoogendoorn Feb 24 '19 at 11:14
  • shouldn't it be the other way around? npi.in_(coo, targets)? I tried your class and it only worked this way and the documentation also says so. Cheers. – Sören Jul 21 '19 at 18:50