Can I improve this regex check for valid domain names?

Question

So, I have been working on this domain name regular expression. So far, it seems to pick up domain names with SLDs and TLDs (with the optional ccTLD), but there is duplication of the TLD listing. Can this be refactored any further?

params[:domain_name].downcase.strip.match(/^[a-z0-9\-]{2,63}
\.((a[cdefgilmnoqrstuwxz]|aero|arpa)|(b[abdefghijmnorstvwyz]|biz)|
(c[acdfghiklmnorsuvxyz]|cat|com|coop)|d[ejkmoz]|(e[ceghrstu]|edu)|f[ijkmor]|
(g[abdefghilmnpqrstuwy]|gov)|h[kmnrtu]|(i[delmnoqrst]|info|int)|
(j[emop]|jobs)|k[eghimnprwyz]|l[abcikrstuvy]|
(m[acdghklmnopqrstuvwxyz]|me|mil|mobi|museum)|(n[acefgilopruz]|name|net)|(om|org)|
(p[aefghklmnrstwy]|pro)|qa|r[eouw]|s[abcdeghijklmnortvyz]|
(t[cdfghjklmnoprtvwz]|travel)|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw])
(\.((a[cdefgilmnoqrstuwxz]|aero|arpa)|(b[abdefghijmnorstvwyz]|biz)|
(c[acdfghiklmnorsuvxyz]|cat|com|coop)|d[ejkmoz]|(e[ceghrstu]|edu)|f[ijkmor]|
(g[abdefghilmnpqrstuwy]|gov)|h[kmnrtu]|(i[delmnoqrst]|info|int)|
(j[emop]|jobs)|k[eghimnprwyz]|l[abcikrstuvy]|
m[acdghklmnopqrstuvwxyz]|mil|mobi|museum)|
(n[acefgilopruz]|name|net)|(om|org)|
(p[aefghklmnrstwy]|pro)|qa|r[eouw]|s[abcdeghijklmnortvyz]|
(t[cdfghjklmnoprtvwz]|travel)|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]))?$/)

What is your use case for such a regex which needs to be maintained when new domains are created? — mark, Dec 30 '08 at 10:41
Since all of the answers seem to be giving other ways to lookup TLDs, I propose renaming this question to avoid duplication in the future (unless people actually start answering the refactoring question) — TheSoftwareJedi, Dec 30 '08 at 21:21

score 28 · Accepted Answer · edited Dec 30 '08 at 21:25

28

Please, please, please don't use a fixed and horribly complicated regex like this to match for known domain names.

The list of TLDs is not static, particularly with ICANN looking at a streamlined process for new gTLDs. Even the list of ccTLDs changes sometimes!

Have a look at the list available from http://publicsuffix.org/ and write some code that's able to download and parse that list instead.

edited Dec 30 '08 at 21:25

TheSoftwareJedi

34,421
21
109
151

answered Dec 30 '08 at 19:08

Alnitak

334,560
70
407
495

Concerning regular expressions and eye bleeding: http://www.codinghorror.com/blog/archives/001016.html – Gavin Miller Dec 30 '08 at 20:04
removed the code again - any noob can read a file from the net, and without the ! etc handling it's not useful. – Alnitak Dec 30 '08 at 21:21
I guess I agree. There are better ways to do it, but I need something that is incredibly to do registrations/transfers. Any other recommendations? – Josh Delsman Jan 06 '09 at 21:44
There is an opensource C# library that uses publicsuffix.org to parse domains, here: http://code.google.com/p/domainname-parser/ – Dan Esparza May 18 '09 at 05:28

score 4 · Answer 2 · answered Dec 30 '08 at 21:02

4

Download this: http://data.iana.org/TLD/tlds-alpha-by-domain.txt

Example usage (in Python):

import re
def validate(domain):
    valid_domains = [ line.upper().replace('.', '\.').strip() 
                      for line in open('domains.txt') 
                      if line[0] != '#' ]
    r = re.compile(r'^[A-Z0-9\-]{2,63}\.(%s)$' % ('|'.join(valid_domains),))
    return True if r.match(domain.upper()) else False


print validate('stackoverflow.com')
print validate('omnom.nom')

You can factor the domain-list-building out of the validate function to help performance.

answered Dec 30 '08 at 21:02

Steve Losh

19,642
2
51
44

2

Results aren't as expected for domains like awesomedomain.co.uk -- the TLD isn't considered .uk it's .co.uk It's better to use something like http://publicsuffix.org/ – Dan Esparza May 11 '09 at 22:36
@DanEsparza: And yet, [publicsuffix.org](http://publicsuffix.org/) records it as "*.uk" and _not_ as "co.uk". – Dennis Williamson Nov 11 '11 at 20:54
@DennisWilliamson the `*` in the entry for `*.uk` means that every _sub-domain_ of `.uk` is public _except for the ones explicitly listed_. – Alnitak Oct 29 '12 at 15:34

score 0 · Answer 3 · answered Dec 30 '08 at 10:34

0

I don't know enough about domain names probably. But why is domains like "foo.info.com" matched? It seems that the domain name is "info.com" in that particular case.

And you might want to make sure the name starts with [a-z\d]. I don't think you can register a domain that starts with a dash?

answered Dec 30 '08 at 10:34

PEZ

16,821
7
45
66

1

Not all domain names are two part. A single part example: "ck" is the domain for the Cook islands (try http://ck or http://www.ck); my own domain is three part (nichesoftware.co.nz) due to a structure within the .nz TLD. – Bevan Dec 30 '08 at 21:10

annakata · Answer 4 · 2008-12-30T11:28:17.443

-1

Well as you have it written, the TLD part is equivalent but longer than (\.<tldpart>){1,2} but I'm sure it could be fixed for duplication...

edit: yech, no, it would be possible but essentially a very slow brute force list to handle the duplications I think. Simpler and faster to put the possible TLD and SLD+country pairs in a big hashmap and check the substring against that.

edited Dec 30 '08 at 11:28

answered Dec 30 '08 at 10:36

annakata

74,572
17
113
180

Jules · Answer 5 · 2009-05-27T17:37:46.970

-1

You can build up the regex as a string and then do Regexp.new(string).

edited May 27 '09 at 17:37

answered Dec 30 '08 at 10:38

Jules

6,318
2
29
40

score -1 · Answer 6 · answered Dec 30 '08 at 19:55

-1

I'd recommend starting with the rules laid out in RFC 1035, and then working backwards -- but only if you really really really need to do this from scratch. A domain regex pattern has got to be (arguable second only to email address regex patterns) the most common thing out there. I would check out the site regexlib.com and browse through what other folks have done.

answered Dec 30 '08 at 19:55

sammich

335
5
13
19

The RFC technically does not allow all-numeric domain parts, but in practice registrars and nameservers have been allowing them for years now. – nobody Dec 30 '08 at 21:52

Can I improve this regex check for valid domain names?

6 Answers6

Linked