checking for an integer in a pandas series

Question

I ran into a surprising result when looking for an integer id number in a pandas column of integers where I knew the number was in the list. I've now boiled this down to a really simple test case that baffles me. I'm clearly missing something really obvious?!

Here is how I reproduced the problem:

import numpy as np
import pandas as pd

# Create two pandas objects; col_2 is an np.int64 
source_series_1 = pd.DataFrame({'col_1': ['a','b','c','d'], 'col_2':np.int64([1, 2, 3, 4])})
source_series_2 = pd.DataFrame({'col_1': ['a','b','c','d'], 'col_2':np.int64([101, 102, 103, 104])})

Now test membership in the these dfs:

# Test membership in pandas series
print(np.int64(2) in source_series_1.col_2)
print(np.int64(102) in source_series_2.col_2)

output:

True
False # ?!

# But! convert to a simple list...
print(np.int64(2) in list(source_series_1.col_2))  
print(np.int64(102) in list(source_series_2.col_2))

output:

True
True

I note I get the same output for both without the explicit cast:

print(2 in source_series_1.col_2) #True
print(102 in source_series_2.col_2) #False

There is clearly something incredibly simple going on that I am just missing/forgetting. I'd love to understand why source_series_2 fails the 'in' test?

When you use `in`, you are checking whether a value is in the `index` of the Series. `102 in source_series_2.col_2.values` should return `True` — Derek O, Jun 18 '21 at 17:28
Thanks @DerekO. Yup. I actually saw that question before I posted -- but I didn't connect the dots. Sorry! — Lawrence LaPointe, Jun 18 '21 at 17:53
No worries! My vote to close the question is purely because the linked answer explains it best in my opinion so there's no need to repeat the answer here — Derek O, Jun 18 '21 at 17:56

Amit Gupta · Accepted Answer · 2021-06-18T17:34:10.527

0

Well I think the issue here is with the internal working of the functions.

print(np.int64(2) in source_series_1.col_2)
print(np.int64(102) in source_series_2.col_2)

This looks at the index of the DataFrame/Series (source_series_1.col_2 or source_series_2.col_2) in your case.

Where as

print(np.int64(2) in list(source_series_1.col_2))  
print(np.int64(102) in list(source_series_2.col_2))

is explicitly list search in the values as there is no Index.

How I reached on this conclusion is that if you search for

print(np.int64(2) in source_series_1.col_2)
print(np.int64(2) in source_series_2.col_2)

You'll get

True
True

I hope this helps !

edited Jun 18 '21 at 17:34

answered Jun 18 '21 at 17:28

Amit Gupta

2,698
4
24
37

Ideally this should have raised the out of index error in the 1st case but I think it is not happening. – Amit Gupta Jun 18 '21 at 17:29
1

Ahah. I never realized it was actually looking at the index. This makes great sense. Can also fix it simply by: `print(np.int64(102) in source_series_2.col_2.values)`. This really helps. – Lawrence LaPointe Jun 18 '21 at 17:42
Thanks @Amit Gupta. I knew it was something simple! "in" looks at the index not at the values. Onward. Also worth mentioning that `someseries.isin(target)` would be another way to solve the problem but wanted to understand why the simple case wasn't working. – Lawrence LaPointe Jun 18 '21 at 17:47
Yes, I also learned a good thing from this. it was really obvious but never observed. – Amit Gupta Jun 18 '21 at 17:48

score 0 · Answer 2 · answered Jun 18 '21 at 17:28

pd.Series is not meant to work this way with the 'in' operator. For example: This works but it doesnt make sense right?

np.int64(1) in source_series_2.col_2
>>>True
np.int64(2) in source_series_2.col_2
>>>True

But if you look at the actual implementation of what you need to do, you can use this because it does an actual match on the elements.

getattr(source_series_2.col_2,'__eq__')(101)

Output:

0     True
1    False
2    False
3    False
Name: col_2, dtype: bool

checking for an integer in a pandas series

2 Answers2