2

How to rip a URL like http://www.facebook.com/pages/create.php to have a result like this: www.facebook.com?

I tried this way, but doesn't work:

line.split('/', 2)[2]

My problem is probably with that two forward slashes // and some of the URLs start from the www strings.

Thanks for your help, Adia

SilentGhost
  • 307,395
  • 66
  • 306
  • 293
Adia
  • 1,171
  • 5
  • 16
  • 33
  • possible duplicate of [How to split a web address](http://stackoverflow.com/questions/286150/how-to-split-a-web-address) – SilentGhost Jan 19 '11 at 14:19
  • Not quite a duplicate, we should address how to handle the missing 'http://' for the URLs that 'start from the the www string'. Just using urlparse doesn't cover that. – PaulMcG Jan 19 '11 at 14:25
  • possible duplicate of [Slicing URL with Python](http://stackoverflow.com/questions/258746/slicing-url-with-python) – tzot Feb 13 '11 at 11:45

4 Answers4

8

You might want to look at Python's urlparse module.

>>> from urlparse import urlparse
>>> o = urlparse('http://www.facebook.com/pages/create.php')
>>> o.netloc
'www.facebook.com'
grifaton
  • 3,986
  • 4
  • 30
  • 42
  • Yes, it is better to use appropriate tools for common tasks. – eumiro Jan 19 '11 at 14:17
  • 4
    Note that some of the URLs 'start with the www string'. If the leading 'http://' is missing, urlparse fails to parse this. – PaulMcG Jan 19 '11 at 14:26
  • @Paul McGuire : How must I do to vote on a comment? I want to upvote your's – eyquem Jan 19 '11 at 17:13
  • 1
    @Adia : « How to rip a URL LIKE http:// www.facebook.com/pages/create.php » and « Yes, actually some of the URLs don't have the http:// » are contradictory. So grifaton gave an exact answer to your question and a false answer to your problem. But I won't downvote anybody, though. – eyquem Jan 19 '11 at 17:22
  • @eyquem: sorry if I confused anyone. the facebook URL was just an example and there more URLs in the file I am handling that have all kind of domains and structure. Anyway, from all the posts, now I know how to go about the problem. Thanks everyone. – Adia Jan 20 '11 at 09:26
1

Probably the best bet would be returning the server part from a regex, ie,

\/[a-z0-9\-\.]*[a-zA-Z0-9\-]+\.[a-z]{2,3}\/

That can cover www.facebook.com, facebook.com, some-domain.tv, www.some-domain.net, etc.

NOTE: the head and trailing slashes are part of the regex and not regex separators.

mguillech
  • 336
  • 1
  • 2
1

Try:

line.split("//", 1)[-1].split("/", 1)[0]
erbridge
  • 1,376
  • 12
  • 27
0

I would do:

ch[7 if ch[0:7]=='http://' else 0:].partition('/')[0]

I’m not sure it’s valid for all the cases you’ll encounter

Also:

ch[(ch[0:7]=='http://')*7:].partition('/')[0]
eyquem
  • 26,771
  • 7
  • 38
  • 46