0

My dataframe has a summary column with plain text. I also have a dictionary matching new column names as keys to lists of keywords as values. I'd like to add all those columns to my dataframe with each row initialized as 1 if any of their associated keywords is contained in my summary or -99 if no keywords are present.

Here's my code trying to accomplish this:

# headers is a list of strings, keywords is a list of lists.  Each column has a list of keywords
KEYWORDS_DICT = dict(zip(headers, keywords))

for column in KEYWORDS_DICT:
    df[column] = np.where(any(df['summary'].str.contains(keyword) for keyword in KEYWORDS_DICT[column]), 1, -99)
        

It's currently giving me 'ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().' Is there a good way to resolve this or another way to accomplish my goal?

Thanks!

zishaf
  • 23
  • 4
  • 1
    Does this answer your question? [Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()](https://stackoverflow.com/questions/36921951/truth-value-of-a-series-is-ambiguous-use-a-empty-a-bool-a-item-a-any-o) – ouroboros1 Aug 17 '23 at 16:36

2 Answers2

1

The proposed answer gave me all 1s for all columns. I was able to get my desired result by calling '|'.join() on my keyword lists then searching my summary for that string.

zishaf
  • 23
  • 4
0

You have to add a .any after your str.contains, see code below:

# temp data
df = pd.DataFrame({'summary': ["abc", "qwe", "xyz"]})
KEYWORDS_DICT = {'col1': ["abc", "xyz"], "col2": ["nm"]}

# note the added .any()
for column in KEYWORDS_DICT:
    df[column] = np.where(any(df['summary'].str.contains(keyword).any() for keyword in KEYWORDS_DICT[column]), 1, -99)

Output:

{'summary': {0: 'abc', 1: 'qwe', 2: 'xyz'},
 'col1': {0: 1, 1: 1, 2: 1},
 'col2': {0: -99, 1: -99, 2: -99}}
Suraj Shourie
  • 536
  • 2
  • 11