I have an url like:
http://abc.hostname.com/somethings/anything/
I want to get:
hostname.com
What module can I use to accomplish this?
I want to use the same module and method in python2.
I have an url like:
http://abc.hostname.com/somethings/anything/
I want to get:
hostname.com
What module can I use to accomplish this?
I want to use the same module and method in python2.
For parsing the domain of a URL in Python 3, you can use:
from urllib.parse import urlparse
domain = urlparse('http://www.example.test/foo/bar').netloc
print(domain) # --> www.example.test
However, for reliably parsing the top-level domain (example.test
in this example), you need to install a specialized library (e.g., tldextract).
Instead of regex or hand-written solutions, you can use python's urlparse
from urllib.parse import urlparse
print(urlparse('http://abc.hostname.com/somethings/anything/'))
>> ParseResult(scheme='http', netloc='abc.hostname.com', path='/somethings/anything/', params='', query='', fragment='')
print(urlparse('http://abc.hostname.com/somethings/anything/').netloc)
>> abc.hostname.com
To get without the subdomain
t = urlparse('http://abc.hostname.com/somethings/anything/').netloc
print ('.'.join(t.split('.')[-2:]))
>> hostname.com
You can use tldextract.
Example code:
from tldextract import extract
tsd, td, tsu = extract("http://abc.hostname.com/somethings/anything/") # prints abc, hostname, com
url = td + '.' + tsu # will prints as hostname.com
print(url)
Assuming you have it in an accessible string, and assuming we want to be generic for having multiple levels on the top domain, you could:
token=my_string.split('http://')[1].split('/')[0]
top_level=token.split('.')[-2]+'.'+token.split('.')[-1]
We split first by the http://
to remove that from the string. Then we split by the /
to remove all directory or sub-directory parts of the string, and then the [-2]
means we take the second last token after a .
, and append it with the last token, to give us the top level domain.
There are probably more graceful and robust ways to do this, for example if your website is http://.com
it will break, but its a start :)
Try:
from urlparse import urlparse
parsed = urlparse('http://abc.hostname.com/somethings/anything/')
domain = parsed.netloc.split(".")[-2:]
host = ".".join(domain)
print host # will prints hostname.com