How else can I strip a domain name from webpage with no additional library - python?

Question

The post Get domain name from URL suggested multiple libraries to get the top level domain. but

how else can I strip a domain name from webpage with no additional library?

I had tried it with regex it seems to work but I am sure there are better ways of doing it and lots of urls that will break the regex:

>>> import re
>>> url = "https://stackoverflow.com/questions/22143342/how-else-can-i-strip-a-domain-name-from-webpage-with-no-additional-library-pyt"
>>> domain = re.sub("(http://|http://www\\.|www\\.)","",url).split('/')[0]
>>> domain
'stackoverflow.com'
>>> url = "www.apple.com/itune"
>>> re.sub("(http://|http://www\\.|www\\.)","",url).split('/')[0]
>>> 'apple.com'

I've also tried urlparse but it ends up with None:

>>> from urlparse import urlparse
>>> url ='https://stackoverflow.com/questions/22143342/how-else-can-i-strip-a-domain-name-from-webpage-with-no-additional-library-pyt'
>>> urlparse(url).hostname
'stackoverflow.com'
>>> url = 'www.apple.com/itune'
>>> urlparse(url).hostname
>>>

[`urlparse`](http://docs.python.org/2/library/urlparse.html) is not an external library. Why don't you use it? — falsetru, Mar 03 '14 at 09:54
still checking whether my urls break the `urlparse`, i would have use a regex if possible to see how much i get hasten the process... it's a long list, lol... — alvas, Mar 03 '14 at 09:58

score 2 · Accepted Answer · answered Mar 03 '14 at 10:41

2

How about make a function that wraps urlparse ?

>>> from urlparse import urlparse
>>>
>>> def extract_hostname(url):
...     components = urlparse(url)
...     if not components.scheme:
...         components = urlparse('http://' + url)
...     return components.netloc
...
>>> extract_hostname('http://stackoverflow.com/questions/22143342')
'stackoverflow.com'
>>> extract_hostname('www.apple.com/itune')
'www.apple.com'
>>> extract_hostname('file:///usr/bin/python')
''

answered Mar 03 '14 at 10:41

falsetru

357,413
63
732
636

checking components' scheme sounds good =), in fact there are many more instances of `urlparse(url).hostname` becoming Noney, i guess i'll have to catch whatever i can and just throw the noney ones... – alvas Mar 03 '14 at 11:11

score 0 · Answer 2 · answered Mar 03 '14 at 10:01

0

Use urllib.parse standard library.

>>> from urllib.parse import urlparse
>>> url = 'http://stackoverflow.com/questions/22143342/how-else-can-i-strip-a-domain-name-from-webpage-with-no-additional-library-pyt'
>>> urlparse(url).hostname
'stackoverflow.com'

answered Mar 03 '14 at 10:01

pbacterio

1,094
6
12

isn't it from `from urlparse import urlparse` ?? – alvas Mar 03 '14 at 10:18
For python2, yes. My example is for python3 – pbacterio Mar 03 '14 at 10:22

How else can I strip a domain name from webpage with no additional library - python?

2 Answers2