7

So, I have been working on this domain name regular expression. So far, it seems to pick up domain names with SLDs and TLDs (with the optional ccTLD), but there is duplication of the TLD listing. Can this be refactored any further?

params[:domain_name].downcase.strip.match(/^[a-z0-9\-]{2,63}
\.((a[cdefgilmnoqrstuwxz]|aero|arpa)|(b[abdefghijmnorstvwyz]|biz)|
(c[acdfghiklmnorsuvxyz]|cat|com|coop)|d[ejkmoz]|(e[ceghrstu]|edu)|f[ijkmor]|
(g[abdefghilmnpqrstuwy]|gov)|h[kmnrtu]|(i[delmnoqrst]|info|int)|
(j[emop]|jobs)|k[eghimnprwyz]|l[abcikrstuvy]|
(m[acdghklmnopqrstuvwxyz]|me|mil|mobi|museum)|(n[acefgilopruz]|name|net)|(om|org)|
(p[aefghklmnrstwy]|pro)|qa|r[eouw]|s[abcdeghijklmnortvyz]|
(t[cdfghjklmnoprtvwz]|travel)|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw])
(\.((a[cdefgilmnoqrstuwxz]|aero|arpa)|(b[abdefghijmnorstvwyz]|biz)|
(c[acdfghiklmnorsuvxyz]|cat|com|coop)|d[ejkmoz]|(e[ceghrstu]|edu)|f[ijkmor]|
(g[abdefghilmnpqrstuwy]|gov)|h[kmnrtu]|(i[delmnoqrst]|info|int)|
(j[emop]|jobs)|k[eghimnprwyz]|l[abcikrstuvy]|
m[acdghklmnopqrstuvwxyz]|mil|mobi|museum)|
(n[acefgilopruz]|name|net)|(om|org)|
(p[aefghklmnrstwy]|pro)|qa|r[eouw]|s[abcdeghijklmnortvyz]|
(t[cdfghjklmnoprtvwz]|travel)|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]))?$/)
Alnitak
  • 334,560
  • 70
  • 407
  • 495
Josh Delsman
  • 3,022
  • 1
  • 18
  • 28

6 Answers6

28

Please, please, please don't use a fixed and horribly complicated regex like this to match for known domain names.

The list of TLDs is not static, particularly with ICANN looking at a streamlined process for new gTLDs. Even the list of ccTLDs changes sometimes!

Have a look at the list available from http://publicsuffix.org/ and write some code that's able to download and parse that list instead.

TheSoftwareJedi
  • 34,421
  • 21
  • 109
  • 151
Alnitak
  • 334,560
  • 70
  • 407
  • 495
  • Concerning regular expressions and eye bleeding: http://www.codinghorror.com/blog/archives/001016.html – Gavin Miller Dec 30 '08 at 20:04
  • removed the code again - any noob can read a file from the net, and without the ! etc handling it's not useful. – Alnitak Dec 30 '08 at 21:21
  • I guess I agree. There are better ways to do it, but I need something that is incredibly to do registrations/transfers. Any other recommendations? – Josh Delsman Jan 06 '09 at 21:44
  • There is an opensource C# library that uses publicsuffix.org to parse domains, here: http://code.google.com/p/domainname-parser/ – Dan Esparza May 18 '09 at 05:28
4

Download this: http://data.iana.org/TLD/tlds-alpha-by-domain.txt

Example usage (in Python):

import re
def validate(domain):
    valid_domains = [ line.upper().replace('.', '\.').strip() 
                      for line in open('domains.txt') 
                      if line[0] != '#' ]
    r = re.compile(r'^[A-Z0-9\-]{2,63}\.(%s)$' % ('|'.join(valid_domains),))
    return True if r.match(domain.upper()) else False


print validate('stackoverflow.com')
print validate('omnom.nom')

You can factor the domain-list-building out of the validate function to help performance.

Steve Losh
  • 19,642
  • 2
  • 51
  • 44
  • 2
    Results aren't as expected for domains like awesomedomain.co.uk -- the TLD isn't considered .uk it's .co.uk It's better to use something like http://publicsuffix.org/ – Dan Esparza May 11 '09 at 22:36
  • @DanEsparza: And yet, [publicsuffix.org](http://publicsuffix.org/) records it as "*.uk" and _not_ as "co.uk". – Dennis Williamson Nov 11 '11 at 20:54
  • @DennisWilliamson the `*` in the entry for `*.uk` means that every _sub-domain_ of `.uk` is public _except for the ones explicitly listed_. – Alnitak Oct 29 '12 at 15:34
0

I don't know enough about domain names probably. But why is domains like "foo.info.com" matched? It seems that the domain name is "info.com" in that particular case.

And you might want to make sure the name starts with [a-z\d]. I don't think you can register a domain that starts with a dash?

PEZ
  • 16,821
  • 7
  • 45
  • 66
  • 1
    Not all domain names are two part. A single part example: "ck" is the domain for the Cook islands (try http://ck or http://www.ck); my own domain is three part (nichesoftware.co.nz) due to a structure within the .nz TLD. – Bevan Dec 30 '08 at 21:10
-1

Well as you have it written, the TLD part is equivalent but longer than (\.<tldpart>){1,2} but I'm sure it could be fixed for duplication...

edit: yech, no, it would be possible but essentially a very slow brute force list to handle the duplications I think. Simpler and faster to put the possible TLD and SLD+country pairs in a big hashmap and check the substring against that.

annakata
  • 74,572
  • 17
  • 113
  • 180
-1

You can build up the regex as a string and then do Regexp.new(string).

Jules
  • 6,318
  • 2
  • 29
  • 40
-1

I'd recommend starting with the rules laid out in RFC 1035, and then working backwards -- but only if you really really really need to do this from scratch. A domain regex pattern has got to be (arguable second only to email address regex patterns) the most common thing out there. I would check out the site regexlib.com and browse through what other folks have done.

sammich
  • 335
  • 5
  • 13
  • 19
  • The RFC technically does not allow all-numeric domain parts, but in practice registrars and nameservers have been allowing them for years now. – nobody Dec 30 '08 at 21:52