0

The post Get domain name from URL suggested multiple libraries to get the top level domain. but

how else can I strip a domain name from webpage with no additional library?

I had tried it with regex it seems to work but I am sure there are better ways of doing it and lots of urls that will break the regex:

>>> import re
>>> url = "https://stackoverflow.com/questions/22143342/how-else-can-i-strip-a-domain-name-from-webpage-with-no-additional-library-pyt"
>>> domain = re.sub("(http://|http://www\\.|www\\.)","",url).split('/')[0]
>>> domain
'stackoverflow.com'
>>> url = "www.apple.com/itune"
>>> re.sub("(http://|http://www\\.|www\\.)","",url).split('/')[0]
>>> 'apple.com'

I've also tried urlparse but it ends up with None:

>>> from urlparse import urlparse
>>> url ='https://stackoverflow.com/questions/22143342/how-else-can-i-strip-a-domain-name-from-webpage-with-no-additional-library-pyt'
>>> urlparse(url).hostname
'stackoverflow.com'
>>> url = 'www.apple.com/itune'
>>> urlparse(url).hostname
>>> 
Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738

2 Answers2

2

How about make a function that wraps urlparse ?

>>> from urlparse import urlparse
>>>
>>> def extract_hostname(url):
...     components = urlparse(url)
...     if not components.scheme:
...         components = urlparse('http://' + url)
...     return components.netloc
...
>>> extract_hostname('http://stackoverflow.com/questions/22143342')
'stackoverflow.com'
>>> extract_hostname('www.apple.com/itune')
'www.apple.com'
>>> extract_hostname('file:///usr/bin/python')
''
falsetru
  • 357,413
  • 63
  • 732
  • 636
  • checking components' scheme sounds good =), in fact there are many more instances of `urlparse(url).hostname` becoming Noney, i guess i'll have to catch whatever i can and just throw the noney ones... – alvas Mar 03 '14 at 11:11
0

Use urllib.parse standard library.

>>> from urllib.parse import urlparse
>>> url = 'http://stackoverflow.com/questions/22143342/how-else-can-i-strip-a-domain-name-from-webpage-with-no-additional-library-pyt'
>>> urlparse(url).hostname
'stackoverflow.com'
pbacterio
  • 1,094
  • 6
  • 12