94

I have a very large NumPy array

1 40 3
4 50 4
5 60 7
5 49 6
6 70 8
8 80 9
8 72 1
9 90 7
.... 

I want to check to see if a value exists in the 1st column of the array. I've got a bunch of homegrown ways (e.g. iterating through each row and checking), but given the size of the array I'd like to find the most efficient method.

Thanks!

agf
  • 171,228
  • 44
  • 289
  • 238
thegreatt
  • 1,339
  • 2
  • 12
  • 18
  • 2
    You might use binary search if 1st index is in non-decreasing order or consider sorting if you do more than lets say 10 searches – Luka Rahne Aug 17 '11 at 06:39

8 Answers8

109

How about

if value in my_array[:, col_num]:
    do_whatever

Edit: I think __contains__ is implemented in such a way that this is the same as @detly's version

agf
  • 171,228
  • 44
  • 289
  • 238
  • 9
    You know, I've been using `numpy`'s `any()` function so heavily recently, I completely forgot about plain old `in`. – detly Aug 17 '11 at 06:22
  • 15
    Okay, this is (a) more readable and (b) about 40% faster than my answer. – detly Aug 17 '11 at 06:42
  • 7
    In principle, `value in …` can be faster than `any(… == value)`, because it can iterate over the array elements and stop whenever the value is encountered (as opposed to calculating whether each array element is equal to the value, and then checking whether one of the boolean results is true). – Eric O. Lebigot Aug 17 '11 at 08:02
  • 1
    @EOL really? In Python, `any` is short-circuiting, is it not in `numpy`? – agf Aug 17 '11 at 08:08
  • @EricLeschinski Your edit is confusing and doesn't directly relate to the question, so I'm reverting it. Perhaps you can post it as a comment, and rewrite it to be more clear. – agf Oct 13 '17 at 16:22
  • 10
    Things changed since, note that in future @detly's answer would become the only working solution, currently a warning is thrown. for more see https://stackoverflow.com/questions/40659212/futurewarning-elementwise-comparison-failed-returning-scalar-but-in-the-futur for more. – borgr Jan 08 '18 at 14:28
  • 1
    @agf: Base Python `any` short circuits, but `==` with an array operand doesn't - it has to make a whole array of comparison results. `numpy.any` is *supposed* to short-circuit, but there's a long-standing performance regression related to how ufuncs chunk their operands, that prevents the short-circuiting. `in` with a NumPy array could have been implemented in a way that short-circuits, but instead, `thing in array` is basically just implemented as `(array == thing).any()`. That doesn't short-circuit, and even worse, it produces complete nonsense when `thing` is also an array. – user2357112 Aug 07 '23 at 11:09
61

The most obvious to me would be:

np.any(my_array[:, 0] == value)
eduardosufan
  • 1,441
  • 2
  • 23
  • 51
detly
  • 29,332
  • 18
  • 93
  • 152
  • 2
    HI @detly can you add more explaination. it seems very obvious to you but a beginner like me is not. My instinct tells me that this might be the solution that im looking for but i could not try it with out examples :D – jameshwart lopez Apr 11 '18 at 06:46
  • 3
    @jameshwartlopez `my_array[:, 0]` gives you all the rows (indicated by `:`) and for each row the `0`th element, i.e. the first column. This is a simple one-dimensional array, for example `[1, 3, 6, 2, 9]`. If you use the `==` operator in numpy with a scalar, it will do element-wise comparison and return a boolean numpy array of the same shape as the array. So `[1, 3, 6, 2, 9] == 3` gives `[False, True, False, False, False]`. Finally, `np.any` checks, if any of the values in this array are `True`. – Kilian Obermeier May 16 '18 at 14:02
49

To check multiple values, you can use numpy.in1d(), which is an element-wise function version of the python keyword in. If your data is sorted, you can use numpy.searchsorted():

import numpy as np
data = np.array([1,4,5,5,6,8,8,9])
values = [2,3,4,6,7]
print np.in1d(values, data)

index = np.searchsorted(data, values)
print data[index] == values
HYRY
  • 94,853
  • 25
  • 187
  • 187
  • 4
    +1 for the less well-known `numpy.in1d()` and for the very fast `searchsorted()`. – Eric O. Lebigot Aug 17 '11 at 08:06
  • @eryksun: Yeah, interesting. Same observation, here… – Eric O. Lebigot Aug 17 '11 at 13:12
  • 1
    Note that the final line will throw an `IndexError` if any element of `values` is larger than the greatest value of `data`, so that requires specific attention. – fuglede Jul 30 '19 at 09:11
  • @fuglede It's possible to replace `index` with `index % len(data)` or `np.append(index[:-1],0)` equivalently in this case. – mathfux Jan 03 '20 at 16:19
  • 1
    [`np.in1d()`](https://numpy.org/doc/stable/reference/generated/numpy.in1d.html) is limimted only to 1-d numpy arrays. If you want to check if multiple values are in a multidimensional numpy array use [`np.isin()`](https://numpy.org/doc/stable/reference/generated/numpy.isin.html) method. – Aelius Jan 06 '21 at 21:42
28

Fascinating. I needed to improve the speed of a series of loops that must perform matching index determination in this same way. So I decided to time all the solutions here, along with some riff's.

Here are my speed tests for Python 2.7.10:

import timeit
timeit.timeit('N.any(N.in1d(sids, val))', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

18.86137104034424

timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = [20010401010101+x for x in range(1000)]')

15.061666011810303

timeit.timeit('N.in1d(sids, val)', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

11.613027095794678

timeit.timeit('N.any(val == sids)', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

7.670552015304565

timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

5.610057830810547

timeit.timeit('val == sids', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

1.6632978916168213

timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = set([20010401010101+x for x in range(1000)])')

0.0548710823059082

timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = dict(zip([20010401010101+x for x in range(1000)],[True,]*1000))')

0.054754018783569336

Very surprising! Orders of magnitude difference!

To summarize, if you just want to know whether something's in a 1D list or not:

  • 19s N.any(N.in1d(numpy array))
  • 15s x in (list)
  • 8s N.any(x == numpy array)
  • 6s x in (numpy array)
  • .1s x in (set or a dictionary)

If you want to know where something is in the list as well (order is important):

  • 12s N.in1d(x, numpy array)
  • 2s x == (numpy array)
Lukas Mandrake
  • 381
  • 3
  • 5
3

Adding to @HYRY's answer in1d seems to be fastest for numpy. This is using numpy 1.8 and python 2.7.6.

In this test in1d was fastest, however 10 in a look cleaner:

a = arange(0,99999,3)
%timeit 10 in a
%timeit in1d(a, 10)

10000 loops, best of 3: 150 µs per loop
10000 loops, best of 3: 61.9 µs per loop

Constructing a set is slower than calling in1d, but checking if the value exists is a bit faster:

s = set(range(0, 99999, 3))
%timeit 10 in s

10000000 loops, best of 3: 47 ns per loop
Joelmob
  • 1,076
  • 2
  • 10
  • 22
  • 2
    The comparison isn't fair. You need to count the cost of converting an array to a `set`. OP starts with a NumPy array. – jpp Aug 08 '18 at 08:39
  • I didn't mean to compare the methods like that so i edited the post to point out the cost of creating a set. If you already have python set, there is no big difference. – Joelmob Feb 26 '21 at 10:59
0

The most convenient way according to me is:

(Val in X[:, col_num])

where Val is the value that you want to check for and X is the array. In your example, suppose you want to check if the value 8 exists in your the third column. Simply write

(8 in X[:, 2])

This will return True if 8 is there in the third column, else False.

Loochie
  • 2,414
  • 13
  • 20
0

If you are looking for a list of integers, you may use indexing for doing the work. This also works with nd-arrays, but seems to be slower. It may be better when doing this more than once.

def valuesInArray(values, array):
    values = np.asanyarray(values)
    array = np.asanyarray(array)
    assert array.dtype == np.int and values.dtype == np.int
    
    matches = np.zeros(array.max()+1, dtype=np.bool_)
    matches[values] = True
    
    res = matches[array]
    
    return np.any(res), res
    
    
array = np.random.randint(0, 1000, (10000,3))
values = np.array((1,6,23,543,222))

matched, matches = valuesInArray(values, array)

By using numba and njit, I could get a speedup of this by ~x10.

0

I recommend using np.isin.
The guide suggests this function for masking values, but you can simply call any or all on these masks yourself to check membership. I recommend checking speeds with timeit as suggested above.
Do not use for loops, this defeats the idea of using numpy in the first place.
You can either check if any member of the list is in an array by putting the array first, or put the array second to check if it covers all members of the list.

import numpy
a = np.arange(9).reshape((3,3))
any_lookup = [2,6,10,10002,34543,45]
all_lookup = [2,3,4,5]
none_lookup = [-10,435344,-255,557755]

res_any = np.isin(a,any_lookup)
res_all = np.isin(a,all_lookup)
res_none = np.isin(a,none_lookup)

print(res_any)
print(res_all)
print(res_none)

print(res_any.any())
print(res_all.any())
print(res_none.any())

print(np.isin(any_lookup,a).all())
print(np.isin(all_lookup,a).all())

Results:

[[False False  True]
 [False False False]
 [ True False False]]
[[False False  True]
 [ True  True  True]
 [False False False]]
[[False False False]
 [False False False]
 [False False False]]
True
True
False
False
True
juxyper
  • 11
  • 4