0

I ran into a surprising result when looking for an integer id number in a pandas column of integers where I knew the number was in the list. I've now boiled this down to a really simple test case that baffles me. I'm clearly missing something really obvious?!

Here is how I reproduced the problem:

import numpy as np
import pandas as pd

# Create two pandas objects; col_2 is an np.int64 
source_series_1 = pd.DataFrame({'col_1': ['a','b','c','d'], 'col_2':np.int64([1, 2, 3, 4])})
source_series_2 = pd.DataFrame({'col_1': ['a','b','c','d'], 'col_2':np.int64([101, 102, 103, 104])})

Now test membership in the these dfs:

# Test membership in pandas series
print(np.int64(2) in source_series_1.col_2)
print(np.int64(102) in source_series_2.col_2)

output:

True
False # ?!
# But! convert to a simple list...
print(np.int64(2) in list(source_series_1.col_2))  
print(np.int64(102) in list(source_series_2.col_2))

output:

True
True

I note I get the same output for both without the explicit cast:

print(2 in source_series_1.col_2) #True
print(102 in source_series_2.col_2) #False

There is clearly something incredibly simple going on that I am just missing/forgetting. I'd love to understand why source_series_2 fails the 'in' test?

2 Answers2

0

Well I think the issue here is with the internal working of the functions.

print(np.int64(2) in source_series_1.col_2)
print(np.int64(102) in source_series_2.col_2)

This looks at the index of the DataFrame/Series (source_series_1.col_2 or source_series_2.col_2) in your case.

Where as

print(np.int64(2) in list(source_series_1.col_2))  
print(np.int64(102) in list(source_series_2.col_2))

is explicitly list search in the values as there is no Index.

How I reached on this conclusion is that if you search for

print(np.int64(2) in source_series_1.col_2)
print(np.int64(2) in source_series_2.col_2)

You'll get

True
True

I hope this helps !

Amit Gupta
  • 2,698
  • 4
  • 24
  • 37
  • Ideally this should have raised the out of index error in the 1st case but I think it is not happening. – Amit Gupta Jun 18 '21 at 17:29
  • 1
    Ahah. I never realized it was actually looking at the index. This makes great sense. Can also fix it simply by: `print(np.int64(102) in source_series_2.col_2.values)`. This really helps. – Lawrence LaPointe Jun 18 '21 at 17:42
  • Thanks @Amit Gupta. I knew it was something simple! "in" looks at the index not at the values. Onward. Also worth mentioning that `someseries.isin(target)` would be another way to solve the problem but wanted to understand why the simple case wasn't working. – Lawrence LaPointe Jun 18 '21 at 17:47
  • Yes, I also learned a good thing from this. it was really obvious but never observed. – Amit Gupta Jun 18 '21 at 17:48
0

pd.Series is not meant to work this way with the 'in' operator. For example: This works but it doesnt make sense right?

np.int64(1) in source_series_2.col_2
>>>True
np.int64(2) in source_series_2.col_2
>>>True

But if you look at the actual implementation of what you need to do, you can use this because it does an actual match on the elements.

getattr(source_series_2.col_2,'__eq__')(101)

Output:

0     True
1    False
2    False
3    False
Name: col_2, dtype: bool
Red
  • 110
  • 1
  • 9