10

I am parsing a pandas dataframe df1 containing string object rows. I have a reference list of keywords and need to delete every row in df1 containing any word from the reference list.

Currently, I do it like this:

reference_list: ["words", "to", "remove"]
df1 = df1[~df1[0].str.contains(r"words")]
df1 = df1[~df1[0].str.contains(r"to")]
df1 = df1[~df1[0].str.contains(r"remove")]

Which is not not scalable to thousands of words. However, when I do:

df1 = df1[~df1[0].str.contains(reference_word for reference_word in reference_list)]

I yield the error first argument must be string or compiled pattern.

Following this solution, I tried:

reference_list: "words|to|remove" 
df1 = df1[~df1[0].str.contains(reference_list)]

Which doesn't raise an exception but doesn't parse all words eather.

How to effectively use str.contains with a list of words?

cs95
  • 379,657
  • 97
  • 704
  • 746
sudonym
  • 3,788
  • 4
  • 36
  • 61
  • 1
    When you say "not scaleable", do you mean you would have a bunch of repetitive code? If so, use a loop: `for reference_word in reference_list:` – Galen Dec 22 '17 at 07:51
  • Have you tried [this](https://stackoverflow.com/questions/6116978/how-to-replace-multiple-substrings-of-a-string) question? – Sohaib Farooqi Dec 22 '17 at 07:51
  • I'd first join the words and pass them to `str.contains`. – cs95 Dec 22 '17 at 07:51
  • 1
    Can you elaborate on this: `Which doesn't raise an exception but doesn't parse all words eather.`? Can you provide an example that shows that it doesn't work? Because it should. – cs95 Dec 22 '17 at 07:53
  • 1
    @sudonym if you are looking for speed over regex I suggest you to go through Flasktext https://medium.freecodecamp.org/regex-was-taking-5-days-flashtext-does-it-in-15-minutes-55f04411025f for 10000x speed – Bharath M Shetty Dec 22 '17 at 07:57
  • Also ensure that your first column is a column of strings. Use `df.iloc[:, 0] = df.iloc[:, 0].astype(str)` if you're not sure. – cs95 Dec 22 '17 at 07:59

1 Answers1

19

For a scalable solution, do the following -

  1. join the contents of words by the regex OR pipe |
  2. pass this to str.contains
  3. use the result to filter df1

To index the 0th column, don't use df1[0] (as this might be considered ambiguous). It would be better to use loc or iloc (see below).

words = ["words", "to", "remove"]
mask = df1.iloc[:, 0].str.contains(r'\b(?:{})\b'.format('|'.join(words)))
df1 = df1[~mask]

Note: This will also work if words is a Series.


Alternatively, if your 0th column is a column of words only (not sentences), then you can use df.isin, which should be faster -

df1 = df1[~df1.iloc[:, 0].isin(words)]
cs95
  • 379,657
  • 97
  • 704
  • 746