1

I am struggling to get this regular expression to work on non-simple domains.

((https?):\/\/)?(\w+\.)*(?P<domain>\w+)\.(?P<extension>\w+)(\.\w+)(\/.*)?

It works on:

http://google.com
https://google.com
http://www.google.com
https://www.google.com

So in the above examples, it recognises the domain as google and the extension as .com.

But if it is a double word extension, it falls over:

http://www.google.com.hk

In the above example the domain is seen as .com and the extension as .hk.

Do you know how I can tweak the regex to understand .com.hk style extensions?

Thank you.

Tom Brock
  • 920
  • 7
  • 29

3 Answers3

1

Allow a optional "dot-then-word" to be part of the extension:

((https?):\/\/)?(\w+)\.(?P<domain>\w+)\.(?P<extension>\w+(\.\w+)?)(\/.*)?

I also removed the * quantifier from the capture of the first part of the url.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
0

Try this - .*\:\/\/(?:www\.)?([^\/ ]+)

That will match the extensions in the example. It stops at a / or space.

sideroxylon
  • 4,338
  • 1
  • 22
  • 40
0

There's no notion of extension related to domain names, there are only the FQDN (Full Qualified Domain Name), the TLD (Top-Level Domain), labels and Sub-Domains.

If I take your last example http://www.google.com.hk:

  • www, google, com, hk are labels
  • www.google.com.hk is a domain and the FQDN
  • hk is a domain and since it's the last, it's the TLD
  • com.hk is a hk sub-domain
  • google.com.hk is a com.hk sub-domain
  • www.google.com.hk is a google.com.hk sub-domain

The important thing is that the label com has nothing particular here and could be anything. It doesn't have the constraints of a TLD (TLDs can't be anything, you can find a list here).

Conclusion com.hk isn't different from google.hk, google.com, pizza.org, org.pizza (yes the TLD pizza exists), all have two labels separated by a dot: a TLD and its sub-domain.

Note: sometimes the two last labels are called SLD (Second Level Domain).


Whatever the language you use, regex is not the way to go if you want to parse a URL for two main reasons:

  • the URL syntax is more complicated than you think
  • most of the languages have already a tool to do it (urllib.parse in Python, parse_url in PHP, the URI class in C#, java.net.URL in Java, url module in nodejs...)

Using these tools, you can easily extract the hostname from a URL.


After you need to check first if this hostname isn't an IPv4, because in this cases dots don't have the same meaning (there aren't here to split a FQDN into labels but to separate the four numbers), or an IPv6.

Then you only need to split the hostname and to take the last item to obtain the TLD. You can join the remaining items back to get the "sub-domain part" of the hostname.

If your goal is to separate all labels at the end that are in the TLDs list, you have to include this list the way you want in your code and to check if items, from the end, are in it.

But one more time, there is no "extension" in a domain name and even fewer "double word extension".

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125