-1

I have a list of strings as follows:

drop_list = [ "www.stackoverflow", "www.youtube."]

I have a Pandas data frame df with a column, say with name column_name1, which may or may not contain substrings from the drop_list. Sample df is as follows:

columnname1
------------
https://stackoverflow.com/python-pandas-drop-rows-string-match-from-list
https://stackoverflow.com/deleting-dataframe-row-in-pandas-column-value
https://textblob.readthedocs.io/en/dev/quickstart.html#create-a-textblob
https://textblob.readthedocs.io/en/dev/
https://textblob.readthedocs.io/en/stackoverflow
https://textblob.readthedocs.io/en/youtube
https://www.youtube.com/watch?v=sHkjdkjsk
https://www.youtube.com/watch?v=_jksjdkjklI
https://www.youtube.com/watch?v=jkjskjkjkjkw

So, I want to drop all the rows from df, which contains the substring from drop_list.

If I have understood correctly, if the drop_list is the exact value that I want to match, I could have used the following code as per this SO question.

df[~df['column_name1'].isin(to_drop)]

Or using str.contains method as suggested in this answer, if it is just one value

df[~df['column_name1'].str.contains("XYZ")]

Now, how to combine both the approaches to drop the column?The desired output would be to drop any rows containing stackoverflow or youtube from my data frame:

columnname1
------------
https://textblob.readthedocs.io/en/dev/quickstart.html#create-a-textblob
https://textblob.readthedocs.io/en/dev/
https://textblob.readthedocs.io/en/stackoverflow
https://textblob.readthedocs.io/en/youtube

If I run df = df[~df['col1'].str.contains('|'.join(to_drop))] with the the to_drop as it is, it retains the the stackoverflow urls, but deletes youtube urls.

If I change my list to more generic as follows to_drop = ["stackoverflow", "youtube"]

it deletes

https://textblob.readthedocs.io/en/stackoverflow
https://textblob.readthedocs.io/en/youtube

So, all I am trying to do is to delete all rows containing stackoverflow and youtube urls. I am avoiding the use of urlparse library!

Here is a MWE.

kingmakerking
  • 2,017
  • 2
  • 28
  • 44

1 Answers1

1

Try this:

import re

drp = [re.sub(r'www\.|\.$|\.com', '', x) for x in to_drop]
df[~df.col1.str.extract(r'http[s]*://([^/]*).*', expand=False)
  .str.contains('|'.join(drp))]

yields:

                                                                       col1
2  https://textblob.readthedocs.io/en/dev/quickstart.html#create-a-textblob
3                                   https://textblob.readthedocs.io/en/dev/
4                          https://textblob.readthedocs.io/en/stackoverflow
5                                https://textblob.readthedocs.io/en/youtube

Explanation:

In [38]: drp
Out[38]: ['stackoverflow', 'youtube']

In [41]: df.col1.str.extract(r'http[s]*://([^/]*).*', expand=False)
Out[41]:
0          stackoverflow.com
1          stackoverflow.com
2    textblob.readthedocs.io
3    textblob.readthedocs.io
4    textblob.readthedocs.io
5    textblob.readthedocs.io
6            www.youtube.com
7            www.youtube.com
8            www.youtube.com
Name: col1, dtype: object

In [42]: df.col1.str.extract(r'http[s]*://([^/]*).*', expand=False).str.contains('|'.join(drp))
Out[42]:
0     True
1     True
2    False
3    False
4    False
5    False
6     True
7     True
8     True
Name: col1, dtype: bool
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419