25

This problem seems easy but I cannot quite get a nice-looking solution. I have two numpy arrays (A and B), and I want to get the indices of A where the elements of A are in B and also get the indices of A where the elements are not in B.

So, if

A = np.array([1,2,3,4,5,6,7])
B = np.array([2,4,6])

Currently I am using

C = np.searchsorted(A,B)

which takes advantage of the fact that A is in order, and gives me [1, 3, 5], the indices of the elements that are in A. This is great, but how do I get D = [0,2,4,6], the indices of elements of A that are not in B?

askewchan
  • 45,161
  • 17
  • 118
  • 134
DanHickstein
  • 6,588
  • 13
  • 54
  • 90

5 Answers5

40

searchsorted may give you wrong answer if not every element of B is in A. You can use numpy.in1d:

A = np.array([1,2,3,4,5,6,7])
B = np.array([2,4,6,8])
mask = np.in1d(A, B)
print np.where(mask)[0]
print np.where(~mask)[0]

output is:

[1 3 5]
[0 2 4 6]

However in1d() uses sort, which is slow for large datasets. You can use pandas if your dataset is large:

import pandas as pd
np.where(pd.Index(pd.unique(B)).get_indexer(A) >= 0)[0]

Here is the time comparison:

A = np.random.randint(0, 1000, 10000)
B = np.random.randint(0, 1000, 10000)

%timeit np.where(np.in1d(A, B))[0]
%timeit np.where(pd.Index(pd.unique(B)).get_indexer(A) >= 0)[0]

output:

100 loops, best of 3: 2.09 ms per loop
1000 loops, best of 3: 594 µs per loop
ford
  • 10,687
  • 3
  • 47
  • 54
HYRY
  • 94,853
  • 25
  • 187
  • 187
  • 2
    It's good to know about this efficient method because my datasets are very large. Thanks so much for this solution! – DanHickstein Apr 11 '13 at 22:05
8
import numpy as np

A = np.array([1,2,3,4,5,6,7])
B = np.array([2,4,6])
C = np.searchsorted(A, B)

D = np.delete(np.arange(np.alen(A)), C)

D
#array([0, 2, 4, 6])
askewchan
  • 45,161
  • 17
  • 118
  • 134
  • 1
    Thanks! I also like the answer provided by alexhb using np.setdiff1d. I was hoping that there was a function that would give me the indices directly, but this works just fine. – DanHickstein Apr 11 '13 at 02:54
  • There might be, @Dan, but I can't think of it. If you don't need `C`, use his solution, but mine will be twice as fast if you've already got `C`. – askewchan Apr 11 '13 at 02:55
7
import numpy as np

a = np.array([1, 2, 3, 4, 5, 6, 7])
b = np.array([2, 4, 6])
c = np.searchsorted(a, b)
d = np.searchsorted(a, np.setdiff1d(a, b))

d
#array([0, 2, 4, 6])
alexhb
  • 435
  • 2
  • 12
  • Having to search twice slows this down a bit, better to use the already known `C` to get `D`. But, this is of course the better solution if `C` is not needed, so +1. (Welcome to [SO]!) – askewchan Apr 11 '13 at 02:53
  • should the `c` line be deleted? it is not doing anything here – crypdick Mar 20 '23 at 16:27
5

The elements of A that are also in B:

set(A) & set(B)

The elements of A that are not in B:

set(A) - set(B)

Community
  • 1
  • 1
Ben Zweig
  • 59
  • 1
  • 2
  • This does not answer the question (to get indexes, not elements). However, if you want to perform above operation for numpy, do not convert it to set, but use numpy operations instead. See [intersect1d](https://numpy.org/doc/stable/reference/generated/numpy.intersect1d.html?highlight=intersect1d#numpy.intersect1d) and [setdiff1d](https://numpy.org/doc/stable/reference/generated/numpy.setdiff1d.html) (or eventually [setxor1d](https://numpy.org/doc/stable/reference/generated/numpy.setxor1d.html#numpy.setxor1d)). – Nerxis Aug 18 '20 at 13:51
  • Thank you, as I was looking for elements not indices and the question title is ambiguous. I appreciate the numpy operations as well. – PhasorLaser Feb 18 '22 at 22:15
0
all_vals = np.arange(1000)  # `A` in the question
seen_vals = np.unique(np.random.randint(0, 1000, 100))  # `B` in the question
# indices of unseen values
mask = np.isin(all_vals, seen_vals, invert=True)  # `D` in the original question
unseen_vals = all_vals[mask]
crypdick
  • 16,152
  • 7
  • 51
  • 74