There's no notion of extension related to domain names, there are only the FQDN (Full Qualified Domain Name), the TLD (Top-Level Domain), labels and Sub-Domains.
If I take your last example http://www.google.com.hk
:
www
, google
, com
, hk
are labels
www.google.com.hk
is a domain and the FQDN
hk
is a domain and since it's the last, it's the TLD
com.hk
is a hk
sub-domain
google.com.hk
is a com.hk
sub-domain
www.google.com.hk
is a google.com.hk
sub-domain
The important thing is that the label com
has nothing particular here and could be anything. It doesn't have the constraints of a TLD (TLDs can't be anything, you can find a list here).
Conclusion com.hk
isn't different from google.hk
, google.com
, pizza.org
, org.pizza
(yes the TLD pizza
exists), all have two labels separated by a dot: a TLD and its sub-domain.
Note: sometimes the two last labels are called SLD (Second Level Domain).
Whatever the language you use, regex is not the way to go if you want to parse a URL for two main reasons:
- the URL syntax is more complicated than you think
- most of the languages have already a tool to do it (
urllib.parse
in Python, parse_url
in PHP, the URI
class in C#, java.net.URL
in Java, url module in nodejs...)
Using these tools, you can easily extract the hostname from a URL.
After you need to check first if this hostname isn't an IPv4, because in this cases dots don't have the same meaning (there aren't here to split a FQDN into labels but to separate the four numbers), or an IPv6.
Then you only need to split the hostname and to take the last item to obtain the TLD. You can join the remaining items back to get the "sub-domain part" of the hostname.
If your goal is to separate all labels at the end that are in the TLDs list, you have to include this list the way you want in your code and to check if items, from the end, are in it.
But one more time, there is no "extension" in a domain name and even fewer "double word extension".