Pandas: filter dataframe per group with condition matching at least one item in group

Question

I have dataframe

ID   url   session
111   facebook.com   1
111   vk.com   1
111   stackoverflow.com   2
222   wsj.com  3
222   ria.ru   3
222   twitter.com   4
333   wikipedia.org   5
333   rt.com   5

I need to get session, if there are a valid urls

valid_urls = ['rt.com', 'wsj.com']

Desire output

ID   url   session
222   wsj.com  3
222   ria.ru   3
333   wikipedia.org   5
333   rt.com   5

I know, that I can filter using df.url.str.contains, but how can I add there condition with session?

score 2 · Accepted Answer · answered Nov 27 '16 at 18:53

Use transform against each session to find the ones containing at list one valid url, then filter out the dataframe with the resulting boolean series:

df[df.groupby('session')['url'].transform(lambda x : x.isin(valid_urls).any())]

    ID            url  session
3  222        wsj.com        3
4  222         ria.ru        3
6  333  wikipedia.org        5
7  333         rt.com        5

score 0 · Answer 2 · answered Nov 26 '16 at 10:43

0

You can try this :

df = df[(df['url'].str.contains('|'.join(valid_url))) & (df.session > 4)]

answered Nov 26 '16 at 10:43

Mohamed AL ANI

2,012
1
12
29

I want to get all session, there are relevant urls from the list – Petr Petrov Nov 26 '16 at 11:49
then just do print(df["session"]) to get the the session where there is a valid url – Mohamed AL ANI Nov 26 '16 at 13:03

Qichao Zhao · Answer 3 · 2016-11-26T15:06:54.680

Try this:

df = df[df['url'].isin(valid_urls)]

Using your data above and the your valid url list (valid_urls = ['rt.com', 'wsj.com']), you can expect the filtered df to be:

ID   url   session
222   wsj.com  3
333   rt.com   5

If you need to add a second condition with the session, you can use the | (OR) or & (AND) operator as follows:

df = df[(df['url'].isin(valid_urls)) & (df['session'] > 2)]

This lets you filter by two conditions, joined either with OR or AND as you need.

EDIT: If you need to generate a list of valid_urls, you can do this step first:

from urlparse import urlparse
valid_url = []
all_url = df['url'].tolist()
for url in all_url:
    parse_result = urlparse(url)
    if parse_result.netloc != "":
        valid_url.append(url)

Note that this method won't necessarily check the URLs are accessible in a browser though. If you need to verify that you might need to use the requests module to make a HTTP call and see what the response code is.

but my file is bigger. I need a condition to get all urls, where are valid urls. — Petr Petrov, Nov 26 '16 at 12:42
@PetrPetrov so it seems to me the real problem is that you need define a list of valid URLs. What you can do is extract a list of all urls by using `all_url = df['url'].tolist()` and then loop through and validate each one by using urlparse (see: https://stackoverflow.com/questions/22238090/validating-urls-in-python). I'll update my answer with an example too. — Qichao Zhao, Nov 26 '16 at 15:02

jezrael · Answer 4 · 2016-11-27T23:08:10.337

I think you can use isin - first for find all ids and sessions to new DataFrame called same. Last merge with inner join. If need check substrings, use str.contains:

valid_urls = ['rt.com', 'wsj.com']
same = df.loc[df.url.isin(valid_urls), ['ID', 'session']]
#same = df.loc[df.url.str.contains('|'.join(valid_urls)), ['ID', 'session']]
print (same)
    ID  session
3  222        3
7  333        5

print (pd.merge(df, same))
    ID            url  session
0  222        wsj.com        3
1  222         ria.ru        3
2  333  wikipedia.org        5
3  333         rt.com        5

Pandas: filter dataframe per group with condition matching at least one item in group

4 Answers4