Don't use a regex if you can, see if you can parse the url with a dedicated library
This will also help with other TLDs, such as .net
, .org
, .club
.
>>> import urllib.parse
>>> urls = ("https://www.example.com/directory", "www.example.com/directory", "example.com/directory")
>>> for url in urls:
... print(urllib.parse.urlparse("http://" + url.split("//")[-1]))
...
ParseResult(scheme='http', netloc='www.example.com', path='/directory', params='', query='', fragment='')
ParseResult(scheme='http', netloc='www.example.com', path='/directory', params='', query='', fragment='')
ParseResult(scheme='http', netloc='example.com', path='/directory', params='', query='', fragment='')
To get just the top and second-level domain, you could just split()
the netloc
>>> urllib.parse.urlparse("http://whatever.example.com").netloc.split(".")[-2:]
['example', 'com']