python pandas.Series.str.contains WHOLE WORD

Question

df (Pandas Dataframe) has three rows.

col_name
"This is Donald."
"His hands are so small"
"Why are his fingers so short?"

I'd like to extract the row that contains "is" and "small".

If I do

df.col_name.str.contains("is|small", case=False)

Then it catches "His" as well- which I don't want.

Is below query is the right way to catch the whole word in df.series?

df.col_name.str.contains("\bis\b|\bsmall\b", case=False)

Laurel · Accepted Answer · 2020-10-21T11:06:01.893

10

No, the regex /bis/b|/bsmall/b will fail because you are using /b, not \b which means "word boundary".

Change that and you get a match. I would recommend using

\b(is|small)\b

This regex is a little faster and a little more legible, at least to me. Remember to put it in a raw string (r"\b(is|small)\b") so you don’t have to escape the backslashes.

edited Oct 21 '20 at 11:06

answered Sep 07 '16 at 00:54

Laurel

5,965
14
31
57

Thank you. I reflected your point /b -> \b. Still want to wait for few more days to see if there is any other way to catch the whole word. – aerin Sep 07 '16 at 20:28
1

tangentially, I had to add a `r` before the string to get it to work: anybody knows why? I have not found any reference to it.. – mccc Nov 01 '17 at 16:50
well, apparently the `|`char makes it implicitly into a regex, while a `\b` does not.. – mccc Nov 01 '17 at 16:51
@mccc It makes it into a [raw string](https://docs.python.org/2.0/ref/strings.html) (this is a Python thing, not a Pandas or Regex thing). – Laurel Nov 01 '17 at 16:57
@Laurel I think your answer would be more complete if you added the point about using raw string parameter because that is also missing from OP's query. – Mitali Cyrus Oct 21 '20 at 04:13

Alexander · Answer 2 · 2016-09-07T01:54:55.537

First, you may want to convert everything to lowercase, remove punctuation and whitespace and then convert the result into a set of words.

import string

df['words'] = [set(words) for words in
    df['col_name']
    .str.lower()
    .str.replace('[{0}]*'.format(string.punctuation), '')
    .str.strip()
    .str.split()
]

>>> df
                        col_name                                words
0                This is Donald.                   {this, is, donald}
1         His hands are so small         {small, his, so, are, hands}
2  Why are his fingers so short?  {short, fingers, his, so, are, why}

You can now use boolean indexing to see if all of your target words are in these new word sets.

target_words = ['is', 'small']
# Convert target words to lower case just to be safe.
target_words = [word.lower() for word in target_words]

df['match'] = df.words.apply(lambda words: all(target_word in words 
                                               for target_word in target_words))


print(df)
# Output: 
#                         col_name                                words  match
# 0                This is Donald.                   {this, is, donald}  False
# 1         His hands are so small         {small, his, so, are, hands}  False
# 2  Why are his fingers so short?  {short, fingers, his, so, are, why}  False    

target_words = ['so', 'small']
target_words = [word.lower() for word in target_words]

df['match'] = df.words.apply(lambda words: all(target_word in words 
                                               for target_word in target_words))

print(df)
# Output:
# Output: 
#                         col_name                                words  match
# 0                This is Donald.                   {this, is, donald}  False
# 1         His hands are so small         {small, his, so, are, hands}   True
# 2  Why are his fingers so short?  {short, fingers, his, so, are, why}  False

To extract the matching rows:

>>> df.loc[df.match, 'col_name']
# Output:
# 1    His hands are so small
# Name: col_name, dtype: object

To make this all into a single statement using boolean indexing:

df.loc[[all(target_word in word_set for target_word in target_words) 
        for word_set in (set(words) for words in
                         df['col_name']
                         .str.lower()
                         .str.replace('[{0}]*'.format(string.punctuation), '')
                         .str.strip()
                         .str.split())], :]

Thanks for the answer.. I'm trying to use Pandas' inbuilt indexing (because my table contains about 500k rows) but I guess you are indexing it by yourself...? — aerin, Sep 07 '16 at 01:45
@Nico, Please elaborate. Just use boolean indexing on the relevant column to extract the matching rows per the example above. — Alexander, Nov 24 '17 at 18:54
@Alexander Is it possible to find which word is matching in the sentence . — dondapati, Dec 01 '19 at 12:07

Mitali Cyrus · Answer 3 · 2021-03-23T23:35:53.483

In "\bis\b|\bsmall\b", the backslash \b is parsed as ASCII Backspace before it is even passed to the regular expression method for matching/searching. For more information check this document about escape characters. It is mentioned in this document, that

When an ‘r’ or ‘R’ prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string.

Therefore, there are two options -

Use r prefix

df.col_name.str.contains(r"\bis\b|\bsmall\b", case=False)

(Or) Escape the \ character -

df.col_name.str.contains("\\bis\\b|\\bsmall\\b", case=False)

If you want to see an example, here is the Fiddle

score 0 · Answer 4 · answered Sep 07 '16 at 00:43

0

Your way (with /b) didn't work for me. I'm not sure why you can't use the logical operator and (&) since I think that's what you actually want.

This is a silly way to do it, but it works:

mask = lambda x: ("is" in x) & ("small" in x)
series_name.apply(mask)

answered Sep 07 '16 at 00:43

szeitlin

3,197
2
23
19

The example you gave is confusing in that regard, although I see that you've since reworded it to make it a little clearer. This solves what you originally said the problem was "I'd like to extract the row that contains "is" and "small"." – szeitlin Sep 07 '16 at 17:28

score 0 · Answer 5 · answered Jun 03 '20 at 05:22

In extension to the discussion, I want to use a variable inside the regex as follows:

df = df_w[df_w['Country/Region'].str.match("\b(location.loc[i]['country'])\b",case=False)]

If I do not put \b\b, the code returns all the columns with both Sudan and South Sudan. While, when I use "\b(location.loc[i]['country'])\b", it returns empty dataframes. Kindly tell me the correct usage.

python pandas.Series.str.contains WHOLE WORD

5 Answers5

Linked