Regular expression for domain and extension can't handle double word extensions

Question

I am struggling to get this regular expression to work on non-simple domains.

((https?):\/\/)?(\w+\.)*(?P<domain>\w+)\.(?P<extension>\w+)(\.\w+)(\/.*)?

It works on:

http://google.com
https://google.com
http://www.google.com
https://www.google.com

So in the above examples, it recognises the domain as google and the extension as .com.

But if it is a double word extension, it falls over:

http://www.google.com.hk

In the above example the domain is seen as .com and the extension as .hk.

Do you know how I can tweak the regex to understand .com.hk style extensions?

Thank you.

Refer this link.Hope it works. [Click here](http://stackoverflow.com/questions/3809401/what-is-a-good-regular-expression-to-match-a-url) — Karthick Kumar Ramakrishnan, Oct 06 '16 at 11:33
the link from @KarthickKumarRamakrishnan works i think it is good solution http://stackoverflow.com/questions/3809401/what-is-a-good-regular-expression-to-match-a-url — antoni, Oct 06 '16 at 11:38

score 1 · Accepted Answer · answered Oct 06 '16 at 11:55

1

Allow a optional "dot-then-word" to be part of the extension:

((https?):\/\/)?(\w+)\.(?P<domain>\w+)\.(?P<extension>\w+(\.\w+)?)(\/.*)?

I also removed the * quantifier from the capture of the first part of the url.

answered Oct 06 '16 at 11:55

Bohemian

412,405
93
575
722

score 0 · Answer 2 · answered Oct 06 '16 at 11:46

0

Try this - .*\:\/\/(?:www\.)?([^\/ ]+)

That will match the extensions in the example. It stops at a / or space.

answered Oct 06 '16 at 11:46

sideroxylon

4,338
1
22
40

Casimir et Hippolyte · Answer 3 · 2016-10-06T19:47:43.567

There's no notion of extension related to domain names, there are only the FQDN (Full Qualified Domain Name), the TLD (Top-Level Domain), labels and Sub-Domains.

If I take your last example http://www.google.com.hk:

www, google, com, hk are labels
www.google.com.hk is a domain and the FQDN
hk is a domain and since it's the last, it's the TLD
com.hk is a hk sub-domain
google.com.hk is a com.hk sub-domain
www.google.com.hk is a google.com.hk sub-domain

The important thing is that the label com has nothing particular here and could be anything. It doesn't have the constraints of a TLD (TLDs can't be anything, you can find a list here).

Conclusion com.hk isn't different from google.hk, google.com, pizza.org, org.pizza (yes the TLD pizza exists), all have two labels separated by a dot: a TLD and its sub-domain.

Note: sometimes the two last labels are called SLD (Second Level Domain).

Whatever the language you use, regex is not the way to go if you want to parse a URL for two main reasons:

the URL syntax is more complicated than you think
most of the languages have already a tool to do it (urllib.parse in Python, parse_url in PHP, the URI class in C#, java.net.URL in Java, url module in nodejs...)

Using these tools, you can easily extract the hostname from a URL.

After you need to check first if this hostname isn't an IPv4, because in this cases dots don't have the same meaning (there aren't here to split a FQDN into labels but to separate the four numbers), or an IPv6.

Then you only need to split the hostname and to take the last item to obtain the TLD. You can join the remaining items back to get the "sub-domain part" of the hostname.

If your goal is to separate all labels at the end that are in the TLDs list, you have to include this list the way you want in your code and to check if items, from the end, are in it.

But one more time, there is no "extension" in a domain name and even fewer "double word extension".

Regular expression for domain and extension can't handle double word extensions

3 Answers3