1

I have a pandas dataframe that has reviews in it an I want to search for a specific word in all of the columns.

df["Summary"].str.lower().str.contains("great", na=False)

This gives the outcome as true or false, but I want to create a new column with 1 or 0 written in the corresponding rows.

For example if the review has 'great' in it it should give as 1, not 2. I tried this:

if df["Summary"].str.lower().str.contains("great", na=False) == True:
    df["Great"] = '1'
else:
    df["Great"] = '0'

It gives this error: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). How can I solve this?

cs95
  • 379,657
  • 97
  • 704
  • 746
fawkemvegas
  • 13
  • 1
  • 5
  • Try [`np.where`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html). `df["Great"] = np.where(df["Summary"].str.lower().contains("great", na=False), '1', '0')` – 0x5453 May 17 '19 at 18:49

3 Answers3

2

Since True/False corresponds to 1/0, all you need is an astype conversion from bool to int:

df['Great'] = df["Summary"].str.contains("great", case=False, na=False).astype(int)

Also note I've removed the str.lower call and added case=False as an argument to str.contains for a case insensitive comparison.


Another solution would be to lowercase and then disable the regex matching for better performance.

df['Great'] = (df["Summary"].str.lower()
                            .str.contains("great", regex=False, na=False)
                            .astype(int))

Finally, you can also use a list comprehension:

df['Great'] = [1 if 'great' in s.lower() else 0 for s in df['Summary']]

If you need to handle numeric data as well, use

df['Great'] = [
    1 if isinstance(s, str) and 'great' in s.lower() else 0 
    for s in df['Summary']
]

I've detailed the advantages of list comprehensions for object data ad nauseam in this post of mine: For loops with pandas - When should I care?

cs95
  • 379,657
  • 97
  • 704
  • 746
2

Your condition df["Summary"].str.lower().str.contains("great", na=False)

Will return a series of True or False values. It won't be equal to "True" because a series is not a python boolean. Instead you can do this to achieve what you want

df['Great'] = df['Summary'].apply(lambda x: 'great' in x.lower())
NBWL
  • 141
  • 6
  • `apply` has limited use cases and should be avoided when there are better (read: vectorized/inbuilt) alternatives. You can read more [here](https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code). – cs95 May 17 '19 at 19:04
  • 1
    thanks for this, I'm going to start using .str accessor over apply now – NBWL May 17 '19 at 19:09
  • Happy coding :)) – cs95 May 17 '19 at 19:10
0

A possible solution using numpy

import numpy as np
df["Great"] = np.where(df["Summary"].str.lower().contains("great", na=False), '1', '0')

Check the documentation here.

Nazim Kerimbekov
  • 4,712
  • 8
  • 34
  • 58
David Sidarous
  • 1,202
  • 1
  • 10
  • 25