1

I'm kinda new to python dataframes, so this might sound really easy. I have a column called 'body_text' in a dataframe and I want to see if each row of body_text contains the word "Hello". And if it does, I want to make another column that has 1 or 0 as the values.

I tried using str.contains("Hello") but that made an error where it only selected the rows that had "Hello" and attempted to put it in another column. I tried looking at other solutions that just ended up in more errors - for loops, and str in str.

textdf = traindf[['request_title','request_text_edit_aware']]
traindf is a huge dataframe that I'm only pulling 2 columns from
anky
  • 74,114
  • 11
  • 41
  • 70
Mei Tei
  • 55
  • 1
  • 7
  • Please add your attempt and the error in an [edit] – roganjosh Jun 08 '19 at 06:38
  • 1
    Hi and welcome to the community. Remember to format code snippets (https://meta.stackexchange.com/questions/22186/how-do-i-format-my-code-blocks) and take a look at this https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples To answer the question properly we need to have an idea of what the dataframe looks like. From thee information you have provided us with, we don't know if you want to search multiple columns for 'hello', if we need to search for strings within a string or just for hello etc... – Ludo Jun 08 '19 at 06:41

3 Answers3

1

If your match is case-sensitive, use Series.str.contains and chain on .astype to cast as int:

df['contains_hello'] = df['body_text'].str.contains('Hello').astype(int)

If it should match, case-insensitive, added the case=False argument:

df['contains_hello'] = df['body_text'].str.contains('Hello', case=False).astype(int)

Update

If you need to match multiple patterns, use regex with | ('OR') character. You may also need a a 'word boundary' character as well depending on your requirements.

Regexr is a good resource if you want to learn more about regex patterns and character classes.

Example

df = pd.DataFrame({'body_text': ['no matches here', 'Hello, this should match', 'high low - dont match', 'oh hi there - match me']})

#                      body_text
#    0           no matches here   
#    1  Hello, this should match   <--  we want to match this 'Hello'
#    2     high low - dont match   <-- 'hi' exists in 'high', but we don't want to match it
#    3    oh hi there - match me   <--  we want to match 'hi' here

df['contains_hello'] = df['body_text'].str.contains(r'Hello|\bhi\b', regex=True).astype(int)

                  body_text  contains_hello
0           no matches here               0
1  Hello, this should match               1
2     high low - dont match               0
3    oh hi there - match me               1

Sometimes it's useful to have a list of words you want to match, to create a regex pattern more easily with a python list comprehension. For example:

match = ['hello', 'hi']    
pat = '|'.join([fr'\b{x}\b' for x in match])
# '\bhello\b|\bhi\b'  -  meaning 'hello' OR 'hi'

df.body_text.str.contains(pat)
Chris Adams
  • 18,389
  • 4
  • 22
  • 39
0

With textdf as you've defined in your question, try:

textdf['new_column'] = [1 if t == 'Hello' else 0 for t in textdf['body_text'] ]
hd1
  • 33,938
  • 5
  • 80
  • 91
0

You can use get_dummies() function in Panda.

Here is the link to documentation.

double-beep
  • 5,031
  • 17
  • 33
  • 41