If your match is case-sensitive, use Series.str.contains
and chain on .astype
to cast as int
:
df['contains_hello'] = df['body_text'].str.contains('Hello').astype(int)
If it should match, case-insensitive, added the case=False
argument:
df['contains_hello'] = df['body_text'].str.contains('Hello', case=False).astype(int)
Update
If you need to match multiple patterns, use regex
with |
('OR') character. You may also need a a 'word boundary' character as well depending on your requirements.
Regexr is a good resource if you want to learn more about regex
patterns and character classes.
Example
df = pd.DataFrame({'body_text': ['no matches here', 'Hello, this should match', 'high low - dont match', 'oh hi there - match me']})
# body_text
# 0 no matches here
# 1 Hello, this should match <-- we want to match this 'Hello'
# 2 high low - dont match <-- 'hi' exists in 'high', but we don't want to match it
# 3 oh hi there - match me <-- we want to match 'hi' here
df['contains_hello'] = df['body_text'].str.contains(r'Hello|\bhi\b', regex=True).astype(int)
body_text contains_hello
0 no matches here 0
1 Hello, this should match 1
2 high low - dont match 0
3 oh hi there - match me 1
Sometimes it's useful to have a list
of words you want to match, to create a regex
pattern more easily with a python list comprehension
. For example:
match = ['hello', 'hi']
pat = '|'.join([fr'\b{x}\b' for x in match])
# '\bhello\b|\bhi\b' - meaning 'hello' OR 'hi'
df.body_text.str.contains(pat)