I have a list of strings as follows:
drop_list = [ "www.stackoverflow", "www.youtube."]
I have a Pandas data frame df
with a column, say with name column_name1
, which may or may not contain substrings from the drop_list. Sample df is as follows:
columnname1
------------
https://stackoverflow.com/python-pandas-drop-rows-string-match-from-list
https://stackoverflow.com/deleting-dataframe-row-in-pandas-column-value
https://textblob.readthedocs.io/en/dev/quickstart.html#create-a-textblob
https://textblob.readthedocs.io/en/dev/
https://textblob.readthedocs.io/en/stackoverflow
https://textblob.readthedocs.io/en/youtube
https://www.youtube.com/watch?v=sHkjdkjsk
https://www.youtube.com/watch?v=_jksjdkjklI
https://www.youtube.com/watch?v=jkjskjkjkjkw
So, I want to drop all the rows from df, which contains the substring from drop_list
.
If I have understood correctly, if the drop_list is the exact value that I want to match, I could have used the following code as per this SO question.
df[~df['column_name1'].isin(to_drop)]
Or using str.contains
method as suggested in this answer, if it is just one value
df[~df['column_name1'].str.contains("XYZ")]
Now, how to combine both the approaches to drop the column?The desired output would be to drop any rows containing stackoverflow or youtube from my data frame:
columnname1
------------
https://textblob.readthedocs.io/en/dev/quickstart.html#create-a-textblob
https://textblob.readthedocs.io/en/dev/
https://textblob.readthedocs.io/en/stackoverflow
https://textblob.readthedocs.io/en/youtube
If I run df = df[~df['col1'].str.contains('|'.join(to_drop))]
with the the to_drop
as it is, it retains the the stackoverflow urls, but deletes youtube urls.
If I change my list to more generic as follows to_drop = ["stackoverflow", "youtube"]
it deletes
https://textblob.readthedocs.io/en/stackoverflow
https://textblob.readthedocs.io/en/youtube
So, all I am trying to do is to delete all rows containing stackoverflow and youtube urls. I am avoiding the use of urlparse library!
Here is a MWE.