This is an extension of Get protocol + host name from URL, with the added requirement that I want only the domain name, not the subdomain.
So, for example,
Input: classes.usc.edu/xxx/yy/zz
Output: usc.edu
Input: mail.google.com
Output: google.com
Input: google.co.uk
Output: google.co.uk
For more context, I accept one or more seed URLs from a user and then run a scrapy crawler on the links. I need the domain name (without the subdomain) to set the allowed_urls
attribute.
I've also taken a look at Python urlparse -- extract domain name without subdomain but the answers there seem outdated.
My current code uses urlparse
but this also gets the subdomain which I don't want...
from urllib.parse import urlparse
uri = urlparse('https://classes.usc.edu/term-20191/classes/csci/')
f'{uri.scheme}://{uri.netloc}/'
# 'https://classes.usc.edu/'
Is there a (hopefully stdlib) way of getting (only) the domain in python-3.x?