Extract domain from URL in python

Question

I have an url like:
http://abc.hostname.com/somethings/anything/

I want to get:
hostname.com

What module can I use to accomplish this?
I want to use the same module and method in python2.

url.split('/')[2] will give you 'abc.hostname.com' you can extract it using split or re any method. — Gahan, May 22 '17 at 12:58

score 125 · Answer 1 · answered Jun 06 '19 at 11:18

For parsing the domain of a URL in Python 3, you can use:

from urllib.parse import urlparse

domain = urlparse('http://www.example.test/foo/bar').netloc
print(domain) # --> www.example.test

However, for reliably parsing the top-level domain (example.test in this example), you need to install a specialized library (e.g., tldextract).

score 75 · Answer 2 · edited Jan 12 '22 at 09:56

75

Instead of regex or hand-written solutions, you can use python's urlparse

from urllib.parse import urlparse

print(urlparse('http://abc.hostname.com/somethings/anything/'))
>> ParseResult(scheme='http', netloc='abc.hostname.com', path='/somethings/anything/', params='', query='', fragment='')

print(urlparse('http://abc.hostname.com/somethings/anything/').netloc)
>> abc.hostname.com

To get without the subdomain

t = urlparse('http://abc.hostname.com/somethings/anything/').netloc
print ('.'.join(t.split('.')[-2:]))
>> hostname.com

edited Jan 12 '22 at 09:56

Herbert

5,279
5
44
69

answered May 22 '17 at 13:14

philshem

24,761
8
61
127

7

In Python3 the lib `urlparse` was renamed to `urllib.parse`. – AIpeter Nov 21 '18 at 15:15
1

will it work with something like test.mytest.example.com ? – qasimzee May 26 '20 at 18:31
@qasimzee it won't, it's getting everything from the first `.` onward – gdvalderrama Dec 09 '21 at 10:52
6

It will fail with `*.co.uk` or `*.ac.uk` domains. – mommi84 Feb 10 '22 at 16:25
@mommi84 You'll need to prepend `http://` – philshem Feb 11 '22 at 06:34
4

`t.split('.')[-2:]` literally keeps only the last two substrings, so I am afraid it will just return `co.uk` and `ac.uk`, whether you prepend that or not. – mommi84 Feb 11 '22 at 10:15
This (wrong due to the mentioned reasons) answer has so many up-votes and then we wonder why different software and websites have so many bugs... – Nairum Jun 03 '22 at 14:34

score 31 · Answer 3 · edited Jun 30 '21 at 21:17

31

You can use tldextract.

Example code:

from tldextract import extract
tsd, td, tsu = extract("http://abc.hostname.com/somethings/anything/") # prints abc, hostname, com
url = td + '.' + tsu # will prints as hostname.com    
print(url)

edited Jun 30 '21 at 21:17

ifly6

5,003
2
24
47

answered May 22 '17 at 13:41

Deivanai Subramanian

382
2
3

4

`tldextract` is not a standard lib ( at least not in python 2.7 ) , I think you should mention that. Still +1 – t.m.adam May 22 '17 at 17:57
Works well! But, getting No handlers could be found for logger "tldextract", how to handle this. – D09r Jun 21 '18 at 14:01

score 4 · Answer 4 · answered May 22 '17 at 12:58

Assuming you have it in an accessible string, and assuming we want to be generic for having multiple levels on the top domain, you could:

token=my_string.split('http://')[1].split('/')[0]
top_level=token.split('.')[-2]+'.'+token.split('.')[-1]

We split first by the http:// to remove that from the string. Then we split by the / to remove all directory or sub-directory parts of the string, and then the [-2] means we take the second last token after a ., and append it with the last token, to give us the top level domain.

There are probably more graceful and robust ways to do this, for example if your website is http://.com it will break, but its a start :)

your code can be simplified more token=my_string.split('/')[2] though it will also work for ftp:// and https:// also. — Gahan, May 22 '17 at 13:00

score -5 · Answer 5 · edited Jan 12 '22 at 09:58

-5

Try:

from urlparse import urlparse

parsed = urlparse('http://abc.hostname.com/somethings/anything/')
domain = parsed.netloc.split(".")[-2:]
host = ".".join(domain)
print host  # will prints hostname.com

edited Jan 12 '22 at 09:58

Herbert

5,279
5
44
69

answered May 22 '17 at 13:17

Sathish Kumar VG

2,154
1
12
19

1

won't work with .co.uk – Quentin Feb 10 '21 at 16:48

Extract domain from URL in python

5 Answers5

Linked

Related