Drop rows from Pandas data frame by matching a substring list

Question

I have a list of strings as follows:

drop_list = [ "www.stackoverflow", "www.youtube."]

I have a Pandas data frame df with a column, say with name column_name1, which may or may not contain substrings from the drop_list. Sample df is as follows:

columnname1
------------
https://stackoverflow.com/python-pandas-drop-rows-string-match-from-list
https://stackoverflow.com/deleting-dataframe-row-in-pandas-column-value
https://textblob.readthedocs.io/en/dev/quickstart.html#create-a-textblob
https://textblob.readthedocs.io/en/dev/
https://textblob.readthedocs.io/en/stackoverflow
https://textblob.readthedocs.io/en/youtube
https://www.youtube.com/watch?v=sHkjdkjsk
https://www.youtube.com/watch?v=_jksjdkjklI
https://www.youtube.com/watch?v=jkjskjkjkjkw

So, I want to drop all the rows from df, which contains the substring from drop_list.

If I have understood correctly, if the drop_list is the exact value that I want to match, I could have used the following code as per this SO question.

df[~df['column_name1'].isin(to_drop)]

Or using str.contains method as suggested in this answer, if it is just one value

df[~df['column_name1'].str.contains("XYZ")]

Now, how to combine both the approaches to drop the column?The desired output would be to drop any rows containing stackoverflow or youtube from my data frame:

columnname1
------------
https://textblob.readthedocs.io/en/dev/quickstart.html#create-a-textblob
https://textblob.readthedocs.io/en/dev/
https://textblob.readthedocs.io/en/stackoverflow
https://textblob.readthedocs.io/en/youtube

If I run df = df[~df['col1'].str.contains('|'.join(to_drop))] with the the to_drop as it is, it retains the the stackoverflow urls, but deletes youtube urls.

If I change my list to more generic as follows to_drop = ["stackoverflow", "youtube"]

it deletes

https://textblob.readthedocs.io/en/stackoverflow
https://textblob.readthedocs.io/en/youtube

So, all I am trying to do is to delete all rows containing stackoverflow and youtube urls. I am avoiding the use of urlparse library!

Here is a MWE.

`.str.contains('|'.join(to_drop))` can be used to check against all elements in `to_drop` but I am not sure if that's what you are asking. A sample would be good in my opinion as well. — ayhan, Feb 25 '17 at 12:45
Added sample dataset. ```.str.contains('|'.join(to_drop))``` will not solve I believe. — kingmakerking, Feb 25 '17 at 12:50
@kingmakerking, why do you believe `.str.contains('|'.join(to_drop))` - will not solve the problem? — MaxU - stand with Ukraine, Feb 25 '17 at 12:53
@MaxU because it will try to match ```www.stackoverflow | www.youtube.``` as a whole pattern due to '|'.join()? I used it before, but for this case,at least in my df, it retained many rows containing stackoverflow or youtube. — kingmakerking, Feb 25 '17 at 12:59
@kingmakerking, so just "adapt" your `drop_list` correspondingly — MaxU - stand with Ukraine, Feb 25 '17 at 13:05
@MaxU but already the drop_list contains the substrings that I want to match with the column values! In other words, in my example, I want to drop all rows containing URLS from stackoverflow or youtube. Sorry, I am unable to understand your hint :( — kingmakerking, Feb 25 '17 at 13:17
@kingmakerking, why can't you cut off all prefixes (`www`, etc.) and/or suffixes from `drop_list` elements? — MaxU - stand with Ukraine, Feb 25 '17 at 13:19
In that case it will drop the rows containing a URL, say ```www,askmeanything.com/goodQuestionin_stackoverflow-question.html``` as well. — kingmakerking, Feb 25 '17 at 13:21
@kingmakerking, please post a __reproducible__ and desired data sets. This will help to avoid `Your suggestion does not work in that case!` situations... — MaxU - stand with Ukraine, Feb 25 '17 at 13:23
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/136618/discussion-between-kingmakerking-and-maxu). — kingmakerking, Feb 25 '17 at 14:31
@MaxU https://nbviewer.jupyter.org/gist/anonymous/74618ab59d2ab6f08a4dfa391894d50b and updated the questions. I hope it is clear now! — kingmakerking, Feb 25 '17 at 14:32

MaxU - stand with Ukraine · Answer 1 · 2017-02-25T15:25:30.867

Try this:

import re

drp = [re.sub(r'www\.|\.$|\.com', '', x) for x in to_drop]
df[~df.col1.str.extract(r'http[s]*://([^/]*).*', expand=False)
  .str.contains('|'.join(drp))]

yields:

                                                                       col1
2  https://textblob.readthedocs.io/en/dev/quickstart.html#create-a-textblob
3                                   https://textblob.readthedocs.io/en/dev/
4                          https://textblob.readthedocs.io/en/stackoverflow
5                                https://textblob.readthedocs.io/en/youtube

Explanation:

In [38]: drp
Out[38]: ['stackoverflow', 'youtube']

In [41]: df.col1.str.extract(r'http[s]*://([^/]*).*', expand=False)
Out[41]:
0          stackoverflow.com
1          stackoverflow.com
2    textblob.readthedocs.io
3    textblob.readthedocs.io
4    textblob.readthedocs.io
5    textblob.readthedocs.io
6            www.youtube.com
7            www.youtube.com
8            www.youtube.com
Name: col1, dtype: object

In [42]: df.col1.str.extract(r'http[s]*://([^/]*).*', expand=False).str.contains('|'.join(drp))
Out[42]:
0     True
1     True
2    False
3    False
4    False
5    False
6     True
7     True
8     True
Name: col1, dtype: bool

Drop rows from Pandas data frame by matching a substring list

1 Answers1