How to extract top-level domain name (TLD) from URL

Question

how would you extract the domain name from a URL, excluding any subdomains?

My initial simplistic attempt was:

'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])

This works for http://www.foo.com, but not http://www.foo.com.au. Is there a way to do this properly without using special knowledge about valid TLDs (Top Level Domains) or country codes (because they change).

thanks

A related question previously on Stack Overflow: http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url — Conspicuous Compiler, Jul 01 '09 at 01:48
+1: The "simplistic attempt" in this question works well for me, even if it ironically didn't work for the author. — ArtOfWarfare, Jun 24 '14 at 15:12
Similar question: http://stackoverflow.com/questions/14406300/python-urlparse-extract-domain-name-without-subdomain — user2314737, Dec 01 '14 at 11:04

score 64 · Answer 1 · edited Dec 01 '14 at 11:02

64

Here's a great python module someone wrote to solve this problem after seeing this question: https://github.com/john-kurkowski/tldextract

The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers

Quote:

tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains] and ccTLDs [Country Code Top-Level Domains] look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

edited Dec 01 '14 at 11:02

user2314737

27,088
20
102
114

answered Sep 12 '11 at 13:46

Acorn

49,061
27
133
172

3

This worked for me where `tld` failed (it marked a valid URL as invalid). – szeitlin Oct 04 '16 at 17:49
2

Lost too much time thinking about the problem, should have known and used this from the start. – Karl Lorey Mar 11 '20 at 12:15

score 54 · Accepted Answer · answered Jul 01 '09 at 01:48

No, there is no "intrinsic" way of knowing that (e.g.) zap.co.it is a subdomain (because Italy's registrar DOES sell domains such as co.it) while zap.co.uk isn't (because the UK's registrar DOESN'T sell domains such as co.uk, but only like zap.co.uk).

You'll just have to use an auxiliary table (or online source) to tell you which TLD's behave peculiarly like UK's and Australia's -- there's no way of divining that from just staring at the string without such extra semantic knowledge (of course it can change eventually, but if you can find a good online source that source will also change accordingly, one hopes!-).

score 42 · Answer 3 · edited May 23 '17 at 12:10

Using this file of effective tlds which someone else found on Mozilla's website:

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tld_file:
    tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]

def get_domain(url, tlds):
    url_elements = urlparse(url)[1].split('.')
    # url_elements = ["abcde","co","uk"]

    for i in range(-len(url_elements), 0):
        last_i_elements = url_elements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk
        wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *
        exception_candidate = "!" + candidate

        # match tlds: 
        if (exception_candidate in tlds):
            return ".".join(url_elements[i:]) 
        if (candidate in tlds or wildcard_candidate in tlds):
            return ".".join(url_elements[i-1:])
            # returns "abcde.co.uk"

    raise ValueError("Domain not in global list of TLDs")

print get_domain("http://abcde.co.uk", tlds)

results in:

abcde.co.uk

I'd appreciate it if someone let me know which bits of the above could be rewritten in a more pythonic way. For example, there must be a better way of iterating over the last_i_elements list, but I couldn't think of one. I also don't know if ValueError is the best thing to raise. Comments?

If you need to call getDomain() often in practice, such as extracting domains from a large log file, I would recommend that you make tlds a set, e.g. tlds = set([line.strip() for line in tldFile if line[0] not in "/\n"]). This gives you constant time lookup for each of those checks for whether some item is in tlds. I saw a speedup of about 1500 times for the lookups (set vs. list) and for my entire operation extracting domains from a ~20 million line log file, about a 60 times speedup (6 minutes down from 6 hours). — Bryce Thomas, Aug 07 '10 at 04:14
This is awesome! Just one more question: is that `effective_tld_names.dat` file also updated for new domains such as `.amsterdam`, `.vodka` and `.wtf`? — kramer65, Aug 04 '15 at 13:50
The Mozilla public suffix list gets regular maintenance, yes, and now has multiple Python libraries which include it. See http://publicsuffix.org/ and the other answers on this page. — tripleee, Mar 29 '17 at 10:35
Some updates to get this right in 2021: the file is now called `public_suffix_list.dat`, and Python will complain if you don't specify that it should read the file as UTF8. Specify the encoding explicitly: `with open("public_suffix_list.dat", encoding="utf8") as tld_file` — Andrei, Nov 06 '21 at 09:30

Artur Barseghyan · Answer 4 · 2018-06-14T11:58:23.320

41

Using python tld

https://pypi.python.org/pypi/tld

Install

pip install tld

Get the TLD name as string from the URL given

from tld import get_tld
print get_tld("http://www.google.co.uk")

co.uk

or without protocol

from tld import get_tld

get_tld("www.google.co.uk", fix_protocol=True)

co.uk

Get the TLD as an object

from tld import get_tld

res = get_tld("http://some.subdomain.google.co.uk", as_object=True)

res
# 'co.uk'

res.subdomain
# 'some.subdomain'

res.domain
# 'google'

res.tld
# 'co.uk'

res.fld
# 'google.co.uk'

res.parsed_url
# SplitResult(
#     scheme='http',
#     netloc='some.subdomain.google.co.uk',
#     path='',
#     query='',
#     fragment=''
# )

Get the first level domain name as string from the URL given

from tld import get_fld

get_fld("http://www.google.co.uk")
# 'google.co.uk'

edited Jun 14 '18 at 11:58

answered May 16 '13 at 06:46

Artur Barseghyan

12,746
4
52
44

2

This will become more unreliable with the new gTLDs. – Sjaak Trekhaak Jun 26 '14 at 13:40
1

Hey, thanks for pointing at this. I guess, when it comes to the point that new gTLDs are actually being used, a proper fix could come into the ``tld`` package. – Artur Barseghyan Jun 26 '14 at 14:12
Thank you @ArturBarseghyan ! Its very easy to use with Python. But I am using it now for enterprise grade product, is it a good idea to continue using it even if gTLDs are not being supported? If yes, when do you think gTLDs will be supported ? Thank you again. – Akshay Patil Dec 11 '14 at 11:21
3

@Akshay Patil: As stated above, when it comes to the point that gTLDs are intensively used, a proper fix (if possible) would arrive in the package. In the meanwhile, if you're concerned much about gTLDs, you can always catch the ``tld.exceptions.TldDomainNotFound`` exception and proceed anyway with whatever you were doing, even if domain hasn't been found. – Artur Barseghyan Dec 11 '14 at 13:04
1

Is it just me, or does `tld.get_tld()` actually return a fully qualified domain name, not a top level domain? – Marian May 12 '15 at 15:49
`get_tld("http://www.google.co.uk", as_object=True).extension` would print out: "co.uk" – Artur Barseghyan May 12 '15 at 19:01
Having URL parsing functionality built in is nice, I suppose, but *requiring* input to be a URL seems misdirected. If I want to handle host names for SSH or whatever, forcing them to be URLs (or "accepting" that the protocol is "missing") is just weird. – tripleee Mar 29 '17 at 08:21
triplee: It works without protocol as well. see the updated example. – Artur Barseghyan Mar 29 '17 at 21:31

S.Lott · Answer 5 · 2009-07-01T10:49:11.477

2

There are many, many TLD's. Here's the list:

http://data.iana.org/TLD/tlds-alpha-by-domain.txt

Here's another list

http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

Here's another list

http://www.iana.org/domains/root/db/

edited Jul 01 '09 at 10:49

answered Jul 01 '09 at 01:51

S.Lott

384,516
81
508
779

1

That doesn't help, because it doesn't tell you which ones have an "extra level", like co.uk. – Lennart Regebro Jul 01 '09 at 07:38
Lennart: It helps, U can wrap them to be optional, within a regex. – lprsd Jul 01 '09 at 09:08

score 0 · Answer 6 · answered Apr 08 '15 at 21:36

Until get_tld is updated for all the new ones, I pull the tld from the error. Sure it's bad code but it works.

def get_tld():
  try:
    return get_tld(self.content_url)
  except Exception, e:
    re_domain = re.compile("Domain ([^ ]+) didn't match any existing TLD name!");
    matchObj = re_domain.findall(str(e))
    if matchObj:
      for m in matchObj:
        return m
    raise e

score -1 · Answer 7 · answered Mar 19 '13 at 18:53

-1

Here's how I handle it:

if not url.startswith('http'):
    url = 'http://'+url
website = urlparse.urlparse(url)[1]
domain = ('.').join(website.split('.')[-2:])
match = re.search(r'((www\.)?([A-Z0-9.-]+\.[A-Z]{2,4}))', domain, re.I)
if not match:
    sys.exit(2)
elif not match.group(0):
    sys.exit(2)

answered Mar 19 '13 at 18:53

Ryan Buckley

123
1
3

3

There is a domain called .travel. It won't work with the above code. – Sri Apr 18 '13 at 00:18

Korayem · Answer 8 · 2019-07-04T08:34:51.947

In Python I used to use tldextract until it failed with a url like www.mybrand.sa.com parsing it as subdomain='order.mybrand', domain='sa', suffix='com'!!

So finally, I decided to write this method

IMPORTANT NOTE: this only works with urls that have a subdomain in them. This isn't meant to replace more advanced libraries like tldextract

def urlextract(url):
  url_split=url.split(".")
  if len(url_split) <= 2:
      raise Exception("Full url required with subdomain:",url)
  return {'subdomain': url_split[0], 'domain': url_split[1], 'suffix': ".".join(url_split[2:])}

How to extract top-level domain name (TLD) from URL

8 Answers8

Install

Get the TLD name as string from the URL given

Get the TLD as an object

Get the first level domain name as string from the URL given

Linked

Related