Split from a specific delimiter

Question

How to rip a URL like http://www.facebook.com/pages/create.php to have a result like this: www.facebook.com?

I tried this way, but doesn't work:

line.split('/', 2)[2]

My problem is probably with that two forward slashes // and some of the URLs start from the www strings.

Thanks for your help, Adia

possible duplicate of [How to split a web address](http://stackoverflow.com/questions/286150/how-to-split-a-web-address) — SilentGhost, Jan 19 '11 at 14:19
Not quite a duplicate, we should address how to handle the missing 'http://' for the URLs that 'start from the the www string'. Just using urlparse doesn't cover that. — PaulMcG, Jan 19 '11 at 14:25
possible duplicate of [Slicing URL with Python](http://stackoverflow.com/questions/258746/slicing-url-with-python) — tzot, Feb 13 '11 at 11:45

score 8 · Accepted Answer · answered Jan 19 '11 at 14:15

8

You might want to look at Python's urlparse module.

>>> from urlparse import urlparse
>>> o = urlparse('http://www.facebook.com/pages/create.php')
>>> o.netloc
'www.facebook.com'

answered Jan 19 '11 at 14:15

grifaton

Yes, it is better to use appropriate tools for common tasks. – eumiro Jan 19 '11 at 14:17
4

Note that some of the URLs 'start with the www string'. If the leading 'http://' is missing, urlparse fails to parse this. – PaulMcG Jan 19 '11 at 14:26
@Paul McGuire : How must I do to vote on a comment? I want to upvote your's – eyquem Jan 19 '11 at 17:13
1

@Adia : « How to rip a URL LIKE http:// www.facebook.com/pages/create.php » and « Yes, actually some of the URLs don't have the http:// » are contradictory. So grifaton gave an exact answer to your question and a false answer to your problem. But I won't downvote anybody, though. – eyquem Jan 19 '11 at 17:22
@eyquem: sorry if I confused anyone. the facebook URL was just an example and there more URLs in the file I am handling that have all kind of domains and structure. Anyway, from all the posts, now I know how to go about the problem. Thanks everyone. – Adia Jan 20 '11 at 09:26

score 1 · Answer 2 · answered Jan 19 '11 at 15:53

Probably the best bet would be returning the server part from a regex, ie,

\/[a-z0-9\-\.]*[a-zA-Z0-9\-]+\.[a-z]{2,3}\/

That can cover www.facebook.com, facebook.com, some-domain.tv, www.some-domain.net, etc.

NOTE: the head and trailing slashes are part of the regex and not regex separators.

score 1 · Answer 3 · answered Jan 19 '11 at 16:27

1

Try:

line.split("//", 1)[-1].split("/", 1)[0]

answered Jan 19 '11 at 16:27

erbridge

eyquem · Answer 4 · 2011-01-19T16:55:40.880

0

I would do:

ch[7 if ch[0:7]=='http://' else 0:].partition('/')[0]

I’m not sure it’s valid for all the cases you’ll encounter

Also:

ch[(ch[0:7]=='http://')*7:].partition('/')[0]

edited Jan 19 '11 at 16:55

answered Jan 19 '11 at 16:44

eyquem

4 Answers4