Scalable solution for str.contains with list of strings in pandas

Question

I am parsing a pandas dataframe df1 containing string object rows. I have a reference list of keywords and need to delete every row in df1 containing any word from the reference list.

Currently, I do it like this:

reference_list: ["words", "to", "remove"]
df1 = df1[~df1[0].str.contains(r"words")]
df1 = df1[~df1[0].str.contains(r"to")]
df1 = df1[~df1[0].str.contains(r"remove")]

Which is not not scalable to thousands of words. However, when I do:

df1 = df1[~df1[0].str.contains(reference_word for reference_word in reference_list)]

I yield the error first argument must be string or compiled pattern.

Following this solution, I tried:

reference_list: "words|to|remove" 
df1 = df1[~df1[0].str.contains(reference_list)]

Which doesn't raise an exception but doesn't parse all words eather.

How to effectively use str.contains with a list of words?

When you say "not scaleable", do you mean you would have a bunch of repetitive code? If so, use a loop: `for reference_word in reference_list:` — Galen, Dec 22 '17 at 07:51
Have you tried [this](https://stackoverflow.com/questions/6116978/how-to-replace-multiple-substrings-of-a-string) question? — Sohaib Farooqi, Dec 22 '17 at 07:51
Can you elaborate on this: `Which doesn't raise an exception but doesn't parse all words eather.`? Can you provide an example that shows that it doesn't work? Because it should. — cs95, Dec 22 '17 at 07:53
@sudonym if you are looking for speed over regex I suggest you to go through Flasktext https://medium.freecodecamp.org/regex-was-taking-5-days-flashtext-does-it-in-15-minutes-55f04411025f for 10000x speed — Bharath M Shetty, Dec 22 '17 at 07:57
Also ensure that your first column is a column of strings. Use `df.iloc[:, 0] = df.iloc[:, 0].astype(str)` if you're not sure. — cs95, Dec 22 '17 at 07:59

cs95 · Accepted Answer · 2018-11-07T03:10:19.717

19

For a scalable solution, do the following -

join the contents of words by the regex OR pipe |
pass this to str.contains
use the result to filter df1

To index the 0^th column, don't use df1[0] (as this might be considered ambiguous). It would be better to use loc or iloc (see below).

words = ["words", "to", "remove"]
mask = df1.iloc[:, 0].str.contains(r'\b(?:{})\b'.format('|'.join(words)))
df1 = df1[~mask]

Note: This will also work if words is a Series.

Alternatively, if your 0^th column is a column of words only (not sentences), then you can use df.isin, which should be faster -

df1 = df1[~df1.iloc[:, 0].isin(words)]

edited Nov 07 '18 at 03:10

answered Dec 22 '17 at 07:58

cs95

379,657
97
704
746

1

@sudonym You're welcome. Did you use contains or isin? – cs95 Dec 22 '17 at 08:04
str.contains, since I have sentences in iloc[:,0] – sudonym Dec 22 '17 at 08:05
@cᴏʟᴅsᴘᴇᴇᴅ we should try to post an answer using `flashtext` it sounds promising. – Bharath M Shetty Dec 22 '17 at 08:27
@Dark Maybe... but I don't think that's designed to work with pandas. – cs95 Dec 22 '17 at 08:29
1

May I ask, what does this mean? r'\b(?:{})\b' – ah bon Jun 03 '18 at 08:58
1

@ahbon I am inserting your search phrases inside a capturing group. I only want full words to be matched. – cs95 Jun 03 '18 at 20:38

Scalable solution for str.contains with list of strings in pandas

1 Answers1

Linked

Related