68

how would you extract the domain name from a URL, excluding any subdomains?

My initial simplistic attempt was:

'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])

This works for http://www.foo.com, but not http://www.foo.com.au. Is there a way to do this properly without using special knowledge about valid TLDs (Top Level Domains) or country codes (because they change).

thanks

John Carter
  • 53,924
  • 26
  • 111
  • 144
hoju
  • 28,392
  • 37
  • 134
  • 178

8 Answers8

64

Here's a great python module someone wrote to solve this problem after seeing this question: https://github.com/john-kurkowski/tldextract

The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers

Quote:

tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains] and ccTLDs [Country Code Top-Level Domains] look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

user2314737
  • 27,088
  • 20
  • 102
  • 114
Acorn
  • 49,061
  • 27
  • 133
  • 172
54

No, there is no "intrinsic" way of knowing that (e.g.) zap.co.it is a subdomain (because Italy's registrar DOES sell domains such as co.it) while zap.co.uk isn't (because the UK's registrar DOESN'T sell domains such as co.uk, but only like zap.co.uk).

You'll just have to use an auxiliary table (or online source) to tell you which TLD's behave peculiarly like UK's and Australia's -- there's no way of divining that from just staring at the string without such extra semantic knowledge (of course it can change eventually, but if you can find a good online source that source will also change accordingly, one hopes!-).

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
42

Using this file of effective tlds which someone else found on Mozilla's website:

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tld_file:
    tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]

def get_domain(url, tlds):
    url_elements = urlparse(url)[1].split('.')
    # url_elements = ["abcde","co","uk"]

    for i in range(-len(url_elements), 0):
        last_i_elements = url_elements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk
        wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *
        exception_candidate = "!" + candidate

        # match tlds: 
        if (exception_candidate in tlds):
            return ".".join(url_elements[i:]) 
        if (candidate in tlds or wildcard_candidate in tlds):
            return ".".join(url_elements[i-1:])
            # returns "abcde.co.uk"

    raise ValueError("Domain not in global list of TLDs")

print get_domain("http://abcde.co.uk", tlds)

results in:

abcde.co.uk

I'd appreciate it if someone let me know which bits of the above could be rewritten in a more pythonic way. For example, there must be a better way of iterating over the last_i_elements list, but I couldn't think of one. I also don't know if ValueError is the best thing to raise. Comments?

Community
  • 1
  • 1
Markus
  • 3,447
  • 3
  • 24
  • 26
  • 10
    If you need to call getDomain() often in practice, such as extracting domains from a large log file, I would recommend that you make tlds a set, e.g. tlds = set([line.strip() for line in tldFile if line[0] not in "/\n"]). This gives you constant time lookup for each of those checks for whether some item is in tlds. I saw a speedup of about 1500 times for the lookups (set vs. list) and for my entire operation extracting domains from a ~20 million line log file, about a 60 times speedup (6 minutes down from 6 hours). – Bryce Thomas Aug 07 '10 at 04:14
  • 1
    This is awesome! Just one more question: is that `effective_tld_names.dat` file also updated for new domains such as `.amsterdam`, `.vodka` and `.wtf`? – kramer65 Aug 04 '15 at 13:50
  • The Mozilla public suffix list gets regular maintenance, yes, and now has multiple Python libraries which include it. See http://publicsuffix.org/ and the other answers on this page. – tripleee Mar 29 '17 at 10:35
  • Some updates to get this right in 2021: the file is now called `public_suffix_list.dat`, and Python will complain if you don't specify that it should read the file as UTF8. Specify the encoding explicitly: `with open("public_suffix_list.dat", encoding="utf8") as tld_file` – Andrei Nov 06 '21 at 09:30
41

Using python tld

https://pypi.python.org/pypi/tld

Install

pip install tld

Get the TLD name as string from the URL given

from tld import get_tld
print get_tld("http://www.google.co.uk") 

co.uk

or without protocol

from tld import get_tld

get_tld("www.google.co.uk", fix_protocol=True)

co.uk

Get the TLD as an object

from tld import get_tld

res = get_tld("http://some.subdomain.google.co.uk", as_object=True)

res
# 'co.uk'

res.subdomain
# 'some.subdomain'

res.domain
# 'google'

res.tld
# 'co.uk'

res.fld
# 'google.co.uk'

res.parsed_url
# SplitResult(
#     scheme='http',
#     netloc='some.subdomain.google.co.uk',
#     path='',
#     query='',
#     fragment=''
# )

Get the first level domain name as string from the URL given

from tld import get_fld

get_fld("http://www.google.co.uk")
# 'google.co.uk'
Artur Barseghyan
  • 12,746
  • 4
  • 52
  • 44
  • 2
    This will become more unreliable with the new gTLDs. – Sjaak Trekhaak Jun 26 '14 at 13:40
  • 1
    Hey, thanks for pointing at this. I guess, when it comes to the point that new gTLDs are actually being used, a proper fix could come into the ``tld`` package. – Artur Barseghyan Jun 26 '14 at 14:12
  • Thank you @ArturBarseghyan ! Its very easy to use with Python. But I am using it now for enterprise grade product, is it a good idea to continue using it even if gTLDs are not being supported? If yes, when do you think gTLDs will be supported ? Thank you again. – Akshay Patil Dec 11 '14 at 11:21
  • 3
    @Akshay Patil: As stated above, when it comes to the point that gTLDs are intensively used, a proper fix (if possible) would arrive in the package. In the meanwhile, if you're concerned much about gTLDs, you can always catch the ``tld.exceptions.TldDomainNotFound`` exception and proceed anyway with whatever you were doing, even if domain hasn't been found. – Artur Barseghyan Dec 11 '14 at 13:04
  • 1
    Is it just me, or does `tld.get_tld()` actually return a fully qualified domain name, not a top level domain? – Marian May 12 '15 at 15:49
  • `get_tld("http://www.google.co.uk", as_object=True).extension` would print out: "co.uk" – Artur Barseghyan May 12 '15 at 19:01
  • Having URL parsing functionality built in is nice, I suppose, but *requiring* input to be a URL seems misdirected. If I want to handle host names for SSH or whatever, forcing them to be URLs (or "accepting" that the protocol is "missing") is just weird. – tripleee Mar 29 '17 at 08:21
  • triplee: It works without protocol as well. see the updated example. – Artur Barseghyan Mar 29 '17 at 21:31
2

There are many, many TLD's. Here's the list:

http://data.iana.org/TLD/tlds-alpha-by-domain.txt

Here's another list

http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

Here's another list

http://www.iana.org/domains/root/db/

S.Lott
  • 384,516
  • 81
  • 508
  • 779
0

Until get_tld is updated for all the new ones, I pull the tld from the error. Sure it's bad code but it works.

def get_tld():
  try:
    return get_tld(self.content_url)
  except Exception, e:
    re_domain = re.compile("Domain ([^ ]+) didn't match any existing TLD name!");
    matchObj = re_domain.findall(str(e))
    if matchObj:
      for m in matchObj:
        return m
    raise e
Russ Savage
  • 598
  • 4
  • 20
-1

Here's how I handle it:

if not url.startswith('http'):
    url = 'http://'+url
website = urlparse.urlparse(url)[1]
domain = ('.').join(website.split('.')[-2:])
match = re.search(r'((www\.)?([A-Z0-9.-]+\.[A-Z]{2,4}))', domain, re.I)
if not match:
    sys.exit(2)
elif not match.group(0):
    sys.exit(2)
Ryan Buckley
  • 123
  • 1
  • 3
-1

In Python I used to use tldextract until it failed with a url like www.mybrand.sa.com parsing it as subdomain='order.mybrand', domain='sa', suffix='com'!!

So finally, I decided to write this method

IMPORTANT NOTE: this only works with urls that have a subdomain in them. This isn't meant to replace more advanced libraries like tldextract

def urlextract(url):
  url_split=url.split(".")
  if len(url_split) <= 2:
      raise Exception("Full url required with subdomain:",url)
  return {'subdomain': url_split[0], 'domain': url_split[1], 'suffix': ".".join(url_split[2:])}
Korayem
  • 12,108
  • 5
  • 69
  • 56