Python Pandas going through an entire column and checking if it contains a certain str

Question

I'm kinda new to python dataframes, so this might sound really easy. I have a column called 'body_text' in a dataframe and I want to see if each row of body_text contains the word "Hello". And if it does, I want to make another column that has 1 or 0 as the values.

I tried using str.contains("Hello") but that made an error where it only selected the rows that had "Hello" and attempted to put it in another column. I tried looking at other solutions that just ended up in more errors - for loops, and str in str.

textdf = traindf[['request_title','request_text_edit_aware']]

traindf is a huge dataframe that I'm only pulling 2 columns from

Hi and welcome to the community. Remember to format code snippets (https://meta.stackexchange.com/questions/22186/how-do-i-format-my-code-blocks) and take a look at this https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples To answer the question properly we need to have an idea of what the dataframe looks like. From thee information you have provided us with, we don't know if you want to search multiple columns for 'hello', if we need to search for strings within a string or just for hello etc... — Ludo, Jun 08 '19 at 06:41

Chris Adams · Accepted Answer · 2019-06-09T08:51:12.953

If your match is case-sensitive, use Series.str.contains and chain on .astype to cast as int:

df['contains_hello'] = df['body_text'].str.contains('Hello').astype(int)

If it should match, case-insensitive, added the case=False argument:

df['contains_hello'] = df['body_text'].str.contains('Hello', case=False).astype(int)

Update

If you need to match multiple patterns, use regex with | ('OR') character. You may also need a a 'word boundary' character as well depending on your requirements.

Regexr is a good resource if you want to learn more about regex patterns and character classes.

Example

df = pd.DataFrame({'body_text': ['no matches here', 'Hello, this should match', 'high low - dont match', 'oh hi there - match me']})

#                      body_text
#    0           no matches here   
#    1  Hello, this should match   <--  we want to match this 'Hello'
#    2     high low - dont match   <-- 'hi' exists in 'high', but we don't want to match it
#    3    oh hi there - match me   <--  we want to match 'hi' here

df['contains_hello'] = df['body_text'].str.contains(r'Hello|\bhi\b', regex=True).astype(int)

                  body_text  contains_hello
0           no matches here               0
1  Hello, this should match               1
2     high low - dont match               0
3    oh hi there - match me               1

Sometimes it's useful to have a list of words you want to match, to create a regex pattern more easily with a python list comprehension. For example:

match = ['hello', 'hi']    
pat = '|'.join([fr'\b{x}\b' for x in match])
# '\bhello\b|\bhi\b'  -  meaning 'hello' OR 'hi'

df.body_text.str.contains(pat)

In the case of multiple characters, how would you implement it? Like I wanted it to also include "Hi" as a string to check for. Thanks for answering! — Mei Tei, Jun 09 '19 at 03:56

score 0 · Answer 2 · answered Jun 08 '19 at 06:59

0

With textdf as you've defined in your question, try:

textdf['new_column'] = [1 if t == 'Hello' else 0 for t in textdf['body_text'] ]

answered Jun 08 '19 at 06:59

hd1

33,938
5
80
91

score 0 · Answer 3 · edited Jun 08 '19 at 09:30

0

You can use get_dummies() function in Panda.

Here is the link to documentation.

edited Jun 08 '19 at 09:30

double-beep

5,031
17
33
41

answered Jun 08 '19 at 09:28

MountainKing

1

Python Pandas going through an entire column and checking if it contains a certain str

3 Answers3

Update

Example