How do I prevent str.contains() from searching for a sub-string?

Question

I want Pandas to search my data frame for the complete string and not a sub-string. Here is a minimal-working example to explain my problem -

data = [['tom', 'wells fargo', 'retired'], ['nick', 'bank of america', 'partner'], ['juli', 'chase', 'director - oil well']]
 
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name', 'Place', 'Position'])
 
# print dataframe.
df
val = 'well'
df.loc[df.apply(lambda col: col.str.contains(val, case=False)).any(axis="columns")]

The correct code would have only returned the second row and not the first one

    Name    Place   Position
0   tom wells fargo retired
2   juli    chase   director - oil well

Update - My intention is to have a search that looks for the exact string requested. While looking for "well" the algorithm shouldn't extract out "well. Based on the comments, I understand how my question might be misleading.

"well" is a substring in "director - oil well" too isn't it? — ali bakhtiari, Jan 10 '23 at 20:20
Indeed. I suppose I am using the wrong term here. I will update the question. — desert_ranger, Jan 10 '23 at 20:23

Corralien · Accepted Answer · 2023-01-10T20:29:27.107

1

IIUC, you can use:

>>> df[~df['Position'].str.contains(fr'\b{val}\b')]

   Name        Place             Position
0   tom  wells fargo              retired
2  juli        chase  director - oil well

And for all columns:

>>> df[~df.apply(lambda x: x.str.contains(fr'\b{val}\b', case=False)).any(axis=1)]

   Name        Place             Position
0   tom  wells fargo              retired
2  juli        chase  director - oil well

edited Jan 10 '23 at 20:29

answered Jan 10 '23 at 20:22

Corralien

109,409
8
28
52

@desert_ranger. I updated my answer. Can you check the second solution please? The key is to use '\b' to match a whole word. – Corralien Jan 10 '23 at 20:29

Kanakorn Horsiritham · Answer 2 · 2023-01-10T21:01:47.663

The regular expression anchor \b which is a word boundary is what you want.

I added addtional data to your code to illustrate more:

import pandas as pd
data = [
          ['tom', 'wells fargo', 'retired']
        , ['nick', 'bank of america', 'partner']
        , ['john','bank of welly','blah']
        , ['jan','bank of somewell knwon','well that\'s it']
        , ['juli', 'chase', 'director - oil well']
        ]
 
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name', 'Place', 'Position'])
 
# print dataframe.
df
val = 'well'
df.loc[df.apply(lambda col: col.str.contains(fr"\b{val}\b", case=False)).any(axis="columns")]

EDIT In Python3 the string can be substitute with the variable with f in front of " or ' and r will express it as regular expression. Then now you can get the val as you want. Thank @smci

and the output is like this

	Name	Place	Position
3	jan	bank of somewell knwon	well that's it
4	juli	chase	director - oil well

You don't *need* to escape it with `\\b`. You can simply **escape it with r'\b`**, that's called a raw string regex. See [Regular Expression HOWTO: backslash escape](https://docs.python.org/3/howto/regex.html?highlight=raw%20string#the-backslash-plague) or https://stackoverflow.com/questions/2081640/what-exactly-do-u-and-r-string-prefixes-do-and-what-are-raw-string-literals — smci, Jan 10 '23 at 20:56

How do I prevent str.contains() from searching for a sub-string?

2 Answers2