2

I want to identify the indices in my numpy array whose value is one of the values contained in a set; for instance, this set my be (5,6,7,8).

Right now I am doing

np.where(np.isin(arr, [5,6,7,8]))

which works fine. I was wondering if there is a better way to achieve this functionality.

cs95
  • 379,657
  • 97
  • 704
  • 746
samuelli97
  • 61
  • 1
  • 8

3 Answers3

4

The code you have is correct and reasonable. You should keep it.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • This doesn't really answer the question, because the best solution depends on the context of its usage. – cs95 Jun 09 '18 at 01:34
3

You can't know whether your current solution is good or not if you don't know what the alternatives are.

First,

np.where(np.isin(arr, val))

Works for any general case. np.isin does a linear search over arr for elements in val.

You can also substitute np.where with np.nonzero, which is a bit faster for larger N.

Next, there is

(arr[:, None] == val).argmax(0)

Which is very fast for small sizes of arr and val (N < 100).

Finally, if arr is sorted, I recommend np.searchsorted.

np.searchsorted(arr, val)

arr = np.arange(100000)
val = np.random.choice(arr, 1000)

%timeit np.where(np.isin(arr, val))
%timeit np.nonzero(np.isin(arr, val))
%timeit (arr[:, None] == val).argmax(0)
%timeit np.searchsorted(arr, val)

8.3 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.88 ms ± 791 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
861 ms ± 6.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
235 µs ± 31.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The problem with (arr[:, None] == val).argmax(0) is the memory blowout - the comparison is broadcasted, introducing a very, very sparse matrix that is wasteful when N is large (so don't use it for large N).

cs95
  • 379,657
  • 97
  • 704
  • 746
  • 1
    You show that `searchsorted()` is fast, but it produces an entirely different output. – John Zwinck Jun 09 '18 at 02:12
  • @JohnZwinck sure, but that doesn't discount the rest of my answer. – cs95 Jun 09 '18 at 03:29
  • 1
    `sorter` arg could be used with `searchsorted` for non-sorted cases. – Divakar Jun 09 '18 at 07:23
  • Super cool, @Divakar! Is that a new feature? Or has it been around for a while? Can't believe I've used this function so much without knowing about it. – cs95 Jun 09 '18 at 07:59
  • 1
    [Nope](https://stackoverflow.com/questions/45830838/determine-the-shuffled-indices-of-two-lists-arrays) :) – Divakar Jun 09 '18 at 08:04
2

Your approach is valid and It also works for multi dimensional arrays.

x = np.arange(9.).reshape(3, 3)
>>> goodvalues = [3, 4, 7]
>>> ix = np.isin(x, goodvalues)
>>> ix
array([[False, False, False],
      [ True,  True, False],
      [False,  True, False]], dtype=bool)
>>> np.where(ix)
(array([1, 1, 2]), array([0, 1, 1]))

This is straight out of the documentation here: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.where.html#numpy.where

iankit
  • 8,806
  • 10
  • 50
  • 56