5

I need to be able to identify a domain name of any subdomain.

Examples:

For all of thiese I need to match only example.co / example.com / example.org / example.co.uk / example.com.au / example.gov.us and so on

www.example.co
www.first.example.co
first.example.co
second.first.example.co
no.matter.how.many.example.co
first.example.co.uk
second.first.example.co.uk
no.matter.how.many.example.co.uk
first.example.org
second.first.example.org
no.matter.how.many.example.org
first.example.gov.uk
second.first.example.gov.uk
no.matter.how.many.example.gov.uk

I have been playing with regular expressions all day and been Googleing for something all day long and still can't seem to find something.

Edit2: I prefer a regex that might fail for very odd cases like t.co then list all TLD's and have the ones I did not list but could have been predicted fail and match more then it should. Isn't this be the option you would chose?

Update: Using the chosen answer as a guide I have constructed this regex that does the job for me.

/([0-9a-z-]{2,}\.[0-9a-z-]{2,3}\.[0-9a-z-]{2,3}|[0-9a-z-]{2,}\.[0-9a-z-]{2,3})$/i

It might not be perfect but so far I have not encountered a case where it fails.

transilvlad
  • 13,974
  • 13
  • 45
  • 80

3 Answers3

6

This will match:

([0-9A-Za-z]{2,}\.[0-9A-Za-z]{2,3}\.[0-9A-Za-z]{2,3}|[0-9A-Za-z]{2,}\.[0-9A-Za-z]{2,3})$

as long as:

  1. there're no extra spaces at the end of each line
  2. all domain codes used are short, two or three letters long. Wil not work with long domain codes like .info.

Bassically what it does is match any of these two:

  1. word two letters or longer:dot:two or three letters word:dot:two or three letters word:end of line
  2. word two letters or longer:dot:two or three letters word:end of line

Short version:

(\w{2,}\.\w{2,3}\.\w{2,3}|\w{2,}\.\w{2,3})$

If you want it to only match whole lines, then add ^ at the beginning

This is how I tested it:

enter image description here

Tulains Córdova
  • 2,559
  • 2
  • 20
  • 33
5

If you want an absolutely correct matcher, regular expressions are not the way to go.

Why?

  • Because both of these are valid domains + TLDs: goo.gl, t.co.

  • Because neither of these are (they're only TLDs): com.au, co.uk.

Any regex that you might create that would properly handle all of the above cases would simply amount to listing out the valid TLDs, which would defeat the purpose of using regular expressions in the first place.

Instead, just create/obtain a list of the current TLDs and see which one of them is present, then add the first segment before it.

Amber
  • 507,862
  • 82
  • 626
  • 550
  • that is what i have done so far and new unknown to me TLD's have failed. example: .nhs.uk – transilvlad Oct 07 '12 at 20:17
  • Validating against a database of valid domains is outside the scope of the problem. A regular expressions can do what he is asking for. – Tulains Córdova Oct 07 '12 at 20:18
  • 3
    @user1598390 No, actually, it can't - at least, not without making that regex become the database of valid TLDs. – Amber Oct 07 '12 at 20:18
  • @tntu - Any regex that is correct is going to fail in a similar manner, since both require listing out the valid TLDs to be correct. – Amber Oct 07 '12 at 20:19
  • i prefer a regex that might fail in some very unpredictable cases then listing all known tld's and encounter new ones – transilvlad Oct 07 '12 at 20:20
  • The problem is that any such regex will fail in a number of very predictable cases, as I've illustrated above. :/ For example, all of the regexs proposed so far would have failed in your `nhs.uk` case. – Amber Oct 07 '12 at 20:22
  • @Amber, he/she doesn't want to check for correctness of domains. He/she just wants to extract domain-name-like patterns from that text sample. – Tulains Córdova Oct 07 '12 at 20:30
  • @Amber i know that either regex or code option would fail at some point but a regex could be made to account for most cases wouldn't it? I prefer that a match would match complete address then having a no match. – transilvlad Oct 07 '12 at 20:30
  • Considering that the IETF will soon allow anyone to register new TLDs , this consideration is pretty much moot. – Philipp Oct 07 '12 at 20:33
  • TLDs are still valid. I don't know why you think they're not. For a while the `to` domain was running a url shortening service: `http://to./`, you just need to make it a `FQDN` to get there. – OmnipotentEntity Jun 27 '13 at 13:06
  • @Amber makes a very valid point. A regex to capture all the possibilities would be too difficult to create and maintain. How to proceed depends on what you need it for. You can catch most problems by not allowing invalid characters, requiring at least one dot, not allowing -- (except after `^xn--`). For full validation using a database might be best. Try [this one](http://en.wikipedia.org/wiki/Domain_Name_System). – fazy Feb 25 '15 at 16:50
0

Might this be of any use. This separates them into a dot notation. Then it is a simple matter of splitting it.
[^/:"]*\.[^/:"]*

Sandman
  • 80
  • 5
Beezer
  • 1,084
  • 13
  • 18