2

I have dataframe

ID   url   session
111   facebook.com   1
111   vk.com   1
111   stackoverflow.com   2
222   wsj.com  3
222   ria.ru   3
222   twitter.com   4
333   wikipedia.org   5
333   rt.com   5

I need to get session, if there are a valid urls

valid_urls = ['rt.com', 'wsj.com']

Desire output

ID   url   session
222   wsj.com  3
222   ria.ru   3
333   wikipedia.org   5
333   rt.com   5

I know, that I can filter using df.url.str.contains, but how can I add there condition with session?

Zeugma
  • 31,231
  • 9
  • 69
  • 81
Petr Petrov
  • 4,090
  • 10
  • 31
  • 68

4 Answers4

2

Use transform against each session to find the ones containing at list one valid url, then filter out the dataframe with the resulting boolean series:

df[df.groupby('session')['url'].transform(lambda x : x.isin(valid_urls).any())]

    ID            url  session
3  222        wsj.com        3
4  222         ria.ru        3
6  333  wikipedia.org        5
7  333         rt.com        5
Zeugma
  • 31,231
  • 9
  • 69
  • 81
0

You can try this :

df = df[(df['url'].str.contains('|'.join(valid_url))) & (df.session > 4)]
Mohamed AL ANI
  • 2,012
  • 1
  • 12
  • 29
0

Try this:

df = df[df['url'].isin(valid_urls)]

Using your data above and the your valid url list (valid_urls = ['rt.com', 'wsj.com']), you can expect the filtered df to be:

ID   url   session
222   wsj.com  3
333   rt.com   5

If you need to add a second condition with the session, you can use the | (OR) or & (AND) operator as follows:

df = df[(df['url'].isin(valid_urls)) & (df['session'] > 2)]

This lets you filter by two conditions, joined either with OR or AND as you need.

EDIT: If you need to generate a list of valid_urls, you can do this step first:

from urlparse import urlparse
valid_url = []
all_url = df['url'].tolist()
for url in all_url:
    parse_result = urlparse(url)
    if parse_result.netloc != "":
        valid_url.append(url)

Note that this method won't necessarily check the URLs are accessible in a browser though. If you need to verify that you might need to use the requests module to make a HTTP call and see what the response code is.

Qichao Zhao
  • 191
  • 7
  • but my file is bigger. I need a condition to get all urls, where are valid urls. – Petr Petrov Nov 26 '16 at 12:42
  • @PetrPetrov so it seems to me the real problem is that you need define a list of valid URLs. What you can do is extract a list of all urls by using `all_url = df['url'].tolist()` and then loop through and validate each one by using urlparse (see: https://stackoverflow.com/questions/22238090/validating-urls-in-python). I'll update my answer with an example too. – Qichao Zhao Nov 26 '16 at 15:02
0

I think you can use isin - first for find all ids and sessions to new DataFrame called same. Last merge with inner join. If need check substrings, use str.contains:

valid_urls = ['rt.com', 'wsj.com']
same = df.loc[df.url.isin(valid_urls), ['ID', 'session']]
#same = df.loc[df.url.str.contains('|'.join(valid_urls)), ['ID', 'session']]
print (same)
    ID  session
3  222        3
7  333        5

print (pd.merge(df, same))
    ID            url  session
0  222        wsj.com        3
1  222         ria.ru        3
2  333  wikipedia.org        5
3  333         rt.com        5
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252