Get protocol and domain (WITHOUT subdomain) from a URL

Question

This is an extension of Get protocol + host name from URL, with the added requirement that I want only the domain name, not the subdomain.

So, for example,

Input: classes.usc.edu/xxx/yy/zz
Output: usc.edu

Input: mail.google.com
Output: google.com

Input: google.co.uk
Output: google.co.uk

For more context, I accept one or more seed URLs from a user and then run a scrapy crawler on the links. I need the domain name (without the subdomain) to set the allowed_urls attribute.

I've also taken a look at Python urlparse -- extract domain name without subdomain but the answers there seem outdated.

My current code uses urlparse but this also gets the subdomain which I don't want...

from urllib.parse import urlparse

uri = urlparse('https://classes.usc.edu/term-20191/classes/csci/')
f'{uri.scheme}://{uri.netloc}/'
# 'https://classes.usc.edu/'

Is there a (hopefully stdlib) way of getting (only) the domain in python-3.x?

score 4 · Accepted Answer · answered Apr 20 '19 at 01:10

4

I am using tldextract When I doing the domain parse.

In your case you only need combine the domain + suffix

import tldextract
tldextract.extract('mail.google.com')
Out[756]: ExtractResult(subdomain='mail', domain='google', suffix='com')
tldextract.extract('classes.usc.edu/xxx/yy/zz')
Out[757]: ExtractResult(subdomain='classes', domain='usc', suffix='edu')
tldextract.extract('google.co.uk')
Out[758]: ExtractResult(subdomain='', domain='google', suffix='co.uk')

answered Apr 20 '19 at 01:10

BENY

317,841
20
164
234

This was also the accepted answer in the other question I linked, although I was hoping to see if there was an stdlib way of doing so. This works, but I'll wait a bit and see if anything else comes in. – cs95 Apr 20 '19 at 03:31
@cs95 the thing about tldextract that you won't find in stdlib is that it is data-dependent (in a good way, but one that the stdlib cannot be), using the public suffix list rather than just naively splitting on `.` I.e., there is no "logic" - it fundamentally requires checking the PSL to get the suffix first – Brad Solomon Apr 20 '19 at 03:47
@BradSolomon oic, fair enough. Thanks for clearing that up. Stdlib is not a necessity, but you can understand the desire to minimise 3rd party dependencies... – cs95 Apr 20 '19 at 03:52

Get protocol and domain (WITHOUT subdomain) from a URL

1 Answers1