11

I am using python and would like a simple api or regex to check for a domain name's validity. By validity I am the syntactical validity and not whether the domain name actually exists on the Internet or not.

demos
  • 2,630
  • 11
  • 35
  • 51
  • For what reason? If this is let's say e-mail, the real validity should be checked by doing an DNS query for MX record, not by regexp. – Kimvais May 24 '10 at 05:33
  • 5
    Nope. There is zero benefit in doing lookups for known invalid names, it's just a waste of time and resources. Also you don't need an MX record to deliver email, an A record is sufficient. – Synchro Mar 22 '12 at 09:50
  • Seems it is already discussed [HERE](http://stackoverflow.com/questions/1128168/validation-for-url-domain-using-regex-rails). – Incognito May 24 '10 at 05:28

5 Answers5

16

Any domain name is (syntactically) valid if it's a dot-separated list of identifiers, each no longer than 63 characters, and made up of letters, digits and dashes (no underscores).

So:

r'[a-zA-Z\d-]{,63}(\.[a-zA-Z\d-]{,63})*'

would be a start. Of course, these days some non-Ascii characters may be allowed (a very recent development) which changes the parameters a lot -- do you need to deal with that?

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • 1
    can an identifier start/end with a hyphen? – Amarghosh May 24 '10 at 05:32
  • Thanks! No, I don't I need some basic sanity check to ensure that it does not contain any blacklisted characters such as ' ! " etc. – demos May 24 '10 at 07:06
  • Alex, I know you are an appengine Guru, please help me with this one: http://stackoverflow.com/questions/2894808/creating-auto-incrementing-column-in-google-appengine Thanks in advance! – demos May 24 '10 at 07:07
  • 2
    @Amarghosh, per RFC 1035, yes: but the RFC also says "when assigning a domain name for an object, the prudent user will select a name" that's more prudent than that (and in particular has each identifier, which it calls 'label', start with a letter, and the whole domain name limited to 255 bytes). "Be conservative in what you generate and liberal in what you accept"!-) Since a RE no doubt has to do with "accept", better it be liberal. – Alex Martelli May 24 '10 at 14:45
  • @demos, I see you got a good answer to that other question (I was asleep by the time you asked it;-). – Alex Martelli May 24 '10 at 14:46
  • @alex yup :) I have 2 more for you: http://stackoverflow.com/questions/2906908/searching-through-model-relationships-in-google-app-engine http://stackoverflow.com/questions/2906746/updating-model-schema-in-google-app-engine Thanks! – demos May 25 '10 at 17:27
  • You got two perfectly correct answers to those two questions, too (even though you apparently don't like them, I can't add anything to those answers). – Alex Martelli May 26 '10 at 02:34
  • 1
    ! is not necessarily 'blacklisted'. RFC2872 says that labels that are not used as hostnames (i.e. which do not map to an IP, for example in TXT or SRV records) may contain any printable ASCII character, so _,;:'"!@£~$ and friends are all up for inclusion. This doc is good: http://domainkeys.sourceforge.net/underscore.html – Synchro Mar 22 '12 at 09:47
  • @Synchro – your point is valid, but does it really apply to this question? It still seems that a valid domain name doesn't allow any characters other than the 'LDH' characters [[Domain Name System – Wikipedia](http://en.wikipedia.org/wiki/DNS_label#Domain_name_syntax)]. – Kenny Evitt Aug 02 '12 at 22:08
  • 1
    The question doesn't specify a context so I think it's not unreasonable to include - I use non-LDH names quite often for looking up DKIM keys, which use names like `blah._domainkey.example.com`. – Synchro Sep 03 '12 at 14:34
  • This is a very good generic regex, but it must be noted that a domain name cannot begin with a dash. – Neil C. Obremski Jan 16 '15 at 17:46
6
r'^(?=.{4,255}$)([a-zA-Z0-9][a-zA-Z0-9-]{,61}[a-zA-Z0-9]\.)+[a-zA-Z0-9]{2,5}$'
  • Lookahead makes sure that it has a minimum of 4 (a.in) and a maximum of 255 characters
  • One or more labels (separated by periods) of length between 1 to 63, starting and ending with alphanumeric characters, and containing alphanumeric chars and hyphens in the middle.
  • Followed by a top level domain name (whose max length is 5 for museum)
Amarghosh
  • 58,710
  • 11
  • 92
  • 121
3

Note that while you can do something with regular expressions, the most reliable way to test for valid domain names is to actually try to resolve the name (with socket.getaddrinfo):

from socket import getaddrinfo

result = getaddrinfo("www.google.com", None)
print result[0][4]

Note that technically this can leave you open to DoS (if someone submits thousands of invalid domain names, it can take a while to resolve invalid names) but you could simply rate-limit someone who tries this.

The advantage of this is that it'll catch "hotmail.con" as invalid (instead of "hotmail.com", say) whereas a regex would say "hotmail.con" is valid.

Dean Harding
  • 71,468
  • 13
  • 145
  • 180
  • 3
    This is really a separate problem and not a good answer to the question. Given that DNS has been used for exploits in the past, checking that a string is at least vagely valid before using it is only sensible, plus it's orders of magnitude faster than a DNS lookup. This is akin to running code to see if it's malicious! – Synchro Mar 22 '12 at 09:45
  • This can not be used for validating domain names that are about to be created, only for already existing ones. – nerdoc Jul 23 '15 at 13:08
  • 1
    @MichaelSmith if you're still wondering nearly a year later, it's because you can't do a DNS lookup on a URL like that - DNS is only for the domain name, so it gets confused by the extra protocol gubbins there. – Xyon Aug 16 '17 at 07:29
0

I've been using this:

(r'(\.|\/)(([A-Za-z\d]+|[A-Za-z\d][-])+[A-Za-z\d]+){1,63}\.([A-Za-z]{2,3}\.[A-Za-z]{2}|[A-Za-z]{2,6})')

to ensure it follows either after dot (www.) or / (http://) and the dash occurs only inside the name and to match suffixes such as gov.uk too.

michalu
  • 13
  • 6
0

The answers are all pretty outdated with the spec at this point. I believe the below will match the current spec correctly:

r'^(?=.{1,253}$)(?!.*\.\..*)(?!\..*)([a-zA-Z0-9-]{,63}\.){,127}[a-zA-Z0-9-]{1,63}$'
Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523