9

Why can't I match a string in a Pandas series using in? In the following example, the first evaluation results in False unexpectedly, but the second one works.

df = pd.DataFrame({'name': [ 'Adam', 'Ben', 'Chris' ]})
'Adam' in df['name']
'Adam' in list(df['name'])
ceiling cat
  • 5,501
  • 9
  • 38
  • 51
  • 1
    `df['name'].eq('Adam').any()` – BENY Mar 20 '18 at 19:57
  • This is where a [mcve] comes in handy. As it is, your question fails to specify if you want `'Adam' 'Adams'` to be `True`. Do you want a series of `True`/`False` signifying the truth for each element? – piRSquared Mar 20 '18 at 20:18

2 Answers2

14

In the first case:

Because the in operator is interpreted as a call to df['name'].__contains__('Adam'). If you look at the implementation of __contains__ in pandas.Series, you will find that it's the following (inhereted from pandas.core.generic.NDFrame) :

def __contains__(self, key):
    """True if the key is in the info axis"""
    return key in self._info_axis

so, your first use of in is interpreted as:

'Adam' in df['name']._info_axis 

This gives False, expectedly, because df['name']._info_axis actually contains information about the range/index and not the data itself:

In [37]: df['name']._info_axis 
Out[37]: RangeIndex(start=0, stop=3, step=1)

In [38]: list(df['name']._info_axis) 
Out[38]: [0, 1, 2]

In the second case:

'Adam' in list(df['name'])

The use of list, converts the pandas.Series to a list of the values. So, the actual operation is this:

In [42]: list(df['name'])
Out[42]: ['Adam', 'Ben', 'Chris']

In [43]: 'Adam' in ['Adam', 'Ben', 'Chris']
Out[43]: True

Here are few more idiomatic ways to do what you want (with the associated speed):

In [56]: df.name.str.contains('Adam').any()
Out[56]: True

In [57]: timeit df.name.str.contains('Adam').any()
The slowest run took 6.25 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 144 µs per loop

In [58]: df.name.isin(['Adam']).any()
Out[58]: True

In [59]: timeit df.name.isin(['Adam']).any()
The slowest run took 5.13 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 191 µs per loop

In [60]: df.name.eq('Adam').any()
Out[60]: True

In [61]: timeit df.name.eq('Adam').any()
10000 loops, best of 3: 178 µs per loop

Note: the last way is also suggested by @Wen in the comment above

Mohamed Ali JAMAOUI
  • 14,275
  • 14
  • 73
  • 117
  • Thanks for the detailed response. Is there a more idiomatic way to do #2? – ceiling cat Mar 20 '18 at 20:30
  • 3
    Given that `list(series)` produces a list of its values, and `iter(series)` produces an iterator over its values, how could anyone possibly expect that `series.__contains__` checks the *index* instead of the values? Seems like a major flaw in the Series API. (But thank you for this explanation, I was pulling my hair out for a while before finding this answer). – BallpointBen Mar 29 '20 at 03:00
  • 3
    This answer is completely wrong. (At least when using Pandas version 0.25.3.) It's understandable and efficient to simply use `'Adam' in df['name'].values`. – Mr. Lance E Sloan Aug 20 '20 at 19:42
  • @Mr.LanceESloan It can be supplemented by the fact that, if you are not afraid of confusing `DataFrame` values, it is indeed quicker to check `DataFrame` values directly than column ones because attribute access takes a comparable long time – Vovin Sep 26 '22 at 09:25
-1
found = df[df['Column'].str.contains('Text_to_search')]
print(len(found))

len(found) will give you the number of number of matches in column.

Shahir Ansari
  • 1,682
  • 15
  • 21