-2

I have a df with multiple columns, one of which is a string with many words(text column).

I also have a set of words, S, that I need to look for.

I want to extract the rows of the df that contain at least one word from S in its text column

df_filtered=df[df['text'].str.contains('word')]

This works for one word from the set S. Instead of looping over S, is there a better way?

rse
  • 25
  • 5

2 Answers2

1

IIUC, you can use | to represent or in regex

df_filtered = df[df['text'].str.contains('|'.join(S))]
Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52
  • Thank you @ynjxsjmh, How do I narrow this down if I want only whole word matches? For eg, now if S contains the word sun, df_filtered has rows that contain sunrise as well. But I want only those rows with "sun" as a word in it – rse Mar 24 '23 at 13:57
  • @rse You can try mozway's answer which uses word boundary. – Ynjxsjmh Mar 24 '23 at 14:30
0

If you want to match full words, use:

import re

pattern = '|'.join(map(re.escape, S))

df_filtered = df[df['text'].str.contains(fr'\b(?:{pattern})\b')]
mozway
  • 194,879
  • 13
  • 39
  • 75