Search of a set of strings in a column containing strings in a pandas dataframe

Question

I have a df with multiple columns, one of which is a string with many words(text column).

I also have a set of words, S, that I need to look for.

I want to extract the rows of the df that contain at least one word from S in its text column

df_filtered=df[df['text'].str.contains('word')]

This works for one word from the set S. Instead of looping over S, is there a better way?

score 1 · Answer 1 · answered Mar 24 '23 at 13:25

1

IIUC, you can use | to represent or in regex

df_filtered = df[df['text'].str.contains('|'.join(S))]

answered Mar 24 '23 at 13:25

Ynjxsjmh

Thank you @ynjxsjmh, How do I narrow this down if I want only whole word matches? For eg, now if S contains the word sun, df_filtered has rows that contain sunrise as well. But I want only those rows with "sun" as a word in it – rse Mar 24 '23 at 13:57
@rse You can try mozway's answer which uses word boundary. – Ynjxsjmh Mar 24 '23 at 14:30

score 0 · Accepted Answer · answered Mar 24 '23 at 13:26

0

If you want to match full words, use:

import re

pattern = '|'.join(map(re.escape, S))

df_filtered = df[df['text'].str.contains(fr'\b(?:{pattern})\b')]

answered Mar 24 '23 at 13:26

mozway

2 Answers2