39

Is there a quick and dirty way to validate if the correct FQDN has been entered? Keep in mind there is no DNS server or Internet connection, so validation has to be done via regex/awk/sed.

Any ideas?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Riaan
  • 538
  • 1
  • 4
  • 14
  • Not really.. At least, it won't be reliable. You can check whether TLD part is valid by keeping a list of your own TLDs (which will need to be kept up-to-date) but other than that I guess you're out of luck :) – favoretti Aug 04 '12 at 15:12
  • 1
    Try this, it's a regex: http://stackoverflow.com/questions/4912520/validate-fqdn-in-c-sharp – tombolinux Aug 04 '12 at 16:31
  • well my idea was to verify that the user has entered a valid dns name e.g groupa-zone1appserver.example.com as to a standard. – Riaan Aug 04 '12 at 16:38
  • http://www.ietf.org/rfc/rfc2181.txt section 11. They don't have to be ascii. – pizza Aug 05 '12 at 00:41

6 Answers6

65
(?=^.{4,253}$)(^((?!-)[a-zA-Z0-9-]{1,63}(?<!-)\.)+[a-zA-Z]{2,63}$)

regex is always going to be at best an approximation for things like this, and rules change over time. the above regex was written with the following in mind and is specific to hostnames-

Hostnames are composed of a series of labels concatenated with dots. Each label is 1 to 63 characters long, and may contain:

  • the ASCII letters a-z (in a case insensitive manner),
  • the digits 0-9,
  • and the hyphen ('-').

Additionally:

some assumptions:

  • TLD is at least 2 characters and only a-z
  • we want at least 1 level above TLD

results: valid / invalid

  • 911.gov - valid
  • 911 - invalid (no TLD)
  • a-.com - invalid
  • -a.com - invalid
  • a.com - valid
  • a.66 - invalid
  • my_host.com - invalid (undescore)
  • typical-hostname33.whatever.co.uk - valid

EDIT: John Rix provided an alternative hack of the regex to make the specification of a TLD optional:

(?=^.{1,253}$)(^(((?!-)[a-zA-Z0-9-]{1,63}(?<!-))|((?!-)[a-zA-Z0-9-]{1,63}(?<!-)\.)+[a-zA-Z]{2,63})$)
  • 911 - valid
  • 911.gov - valid

EDIT 2: someone asked for a version that works in js. the reason it doesn't work in js is because js does not support regex look behind. specifically, the code (?<!-) - which specifies that the previous character cannot be a hyphen.

anyway, here it is rewritten without the lookbehind - a little uglier but not much

(?=^.{4,253}$)(^((?!-)[a-zA-Z0-9-]{0,62}[a-zA-Z0-9]\.)+[a-zA-Z]{2,63}$)

you could likewise make a similar replacement on John Rix's version.

EDIT 3: if you want to allow trailing dots - which is technically allowed:

(?=^.{4,253}\.?$)(^((?!-)[a-zA-Z0-9-]{1,63}(?<!-)\.)+[a-zA-Z]{2,63}\.?$)

I wasn't familiar with trailing dot syntax till @ChaimKut pointed them out and I did some research

Using trailing dots however seems to cause somewhat unpredictable results in the various tools I played with so I would be advise some caution.

bkr
  • 1,444
  • 1
  • 11
  • 22
  • 1
    Here's a (somewhat hacky) alternative version that would also validate a hostname without associated domain. Any improvements? `(?=^.{1,254}$)(^(((?!-)[a-zA-Z0-9-]{1,63}(?<!-))|((?!-)[a-zA-Z0-9-]{1,63}(?<!-)\.)+[a-zA-Z]{2,63})$)` – John Rix Jun 30 '14 at 16:07
  • @John Rix: your regex looks like it works but many people copy/pasting it will find it fails since StackExchange inserts invisible characters into the html source of comments for formatting purposes- look at the HTML source and see http://meta.stackexchange.com/questions/170970/occasionally-the-unicode-character-sequence-u200c-u200b-zwnj-zwsp-is-insert – bkr Jun 30 '14 at 19:42
  • Thanks @bkr, wasn't aware of that. Doesn't like there's a solution, but at least you've exposed this trap for the uninitiated here! – John Rix Jul 01 '14 at 08:25
  • 1
    Can someone provide a Javascript version of this regex? – T Nguyen Sep 22 '14 at 21:35
  • @T Nguyen : see Edit 2 – bkr Sep 23 '14 at 05:57
  • 1
    You need to allow for a trailing dot. See http://en.wikipedia.org/wiki/Fully_qualified_domain_name – ChaimKut Nov 19 '14 at 12:06
  • 1
    hmmm. you are technically correct. I also learned you can only have 253 ascii characters not counting the trailing . – bkr Nov 19 '14 at 20:59
  • How do I go about using this regex in an if statement in bash? – maxisme Jun 17 '15 at 18:50
  • 1
    This doesn't account for punycode in TLDs and counts any trailing dot in with the 253 limit. – Martijn May 30 '18 at 07:14
  • how to use these regex? – Yakob Ubaidi Apr 04 '19 at 09:41
  • @Martijn - I modified Edit3 such that the trailing period is outside the 253 char limit. You are correct that it does not support punycode – bkr Oct 28 '19 at 19:51
  • John Rix (optional TLD) version with no lookback for Javascript support: `(?=^.{1,253}$)(^(((?!-)[a-zA-Z0-9-]{0,62}[a-zA-Z0-9])|((?!-)[a-zA-Z0-9-]{0,62}[a-zA-Z0-9]\.)+[a-zA-Z]{2,63})$)` – Nicholas Betsworth Jul 03 '20 at 14:39
  • Golang doesn't support lookaheads, lookbehinds. Can you give a version without the use of these? – subtleseeker Jun 17 '22 at 06:00
20

It's harder nowadays, with internationalized domain names and several thousand (!) new TLDs.

The easy part is that you can still split the components on ".".

You need a list of registerable TLDs. There's a site for that:

https://publicsuffix.org/list/effective_tld_names.dat

You only need to check the ICANN-recognized ones. Note that a registerable TLD can have more than one component, such as "co.uk".

Then there's IDN and punycode. Domains are Unicode now. For example,

"xn--nnx388a" is equivalent to "臺灣". Both of those are valid TLDs, incidentally.

For punycode conversion code, see "http://golang.org/src/pkg/net/http/cookiejar/punycode.go".

Checking the syntax of each domain component has new rules, too. See RFC5890 at https://www.rfc-editor.org/rfc/rfc5890

Components can be either A-labels (ASCII only) or Unicode. ASCII labels either follow the old syntax, or begin "xn--", in which case they are a punycode version of a Unicode string.

The rules for Unicode are very complex, and are given in RFC5890. The rules are designed to prevent such things as mixing characters from left-to-right and right-to-left sets.

Sorry there's no easy answer.

Community
  • 1
  • 1
John Nagle
  • 1,530
  • 1
  • 14
  • 15
  • 1
    If the validation should work on any network, don't assume FQDNs must end with an official TLD. Internal networks might have any TLD as long as it resolves internally. A classic example is the `.company` internal TLD. – Marcos Dione Jul 08 '20 at 08:45
7

This regex is what you want:

(?=^.{1,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(?:[a-zA-Z]{2,})$)

It match your example domain (groupa-zone1appserver.example.com or cod.eu etc...)

I'll try to explain:

(?=^.{1,254}$) matches domain names (that can begin with any char) that are long between 1 and 254 char, it could be also 5,254 if we assume co.uk is the minimum length.

(^ starting match

(?: define a matching group

(?!\d+\.) the domain name should not be composed by numbers, so 1234.co.uk or abc.123.uk aren't accepted while 1a.ko.uk yes.

[a-zA-Z0-9_\-] the domain names should be composed by words with only a-zA-Z0-9_-

{1,63} the length of any domain level is maximum 63 char, (it could be 2,63)

+ and

(?:[a-zA-Z]{2,})$) the final part of the domain name should not be followed by any other word and must be composed of a word minimum of 2 char a-zA-Z

Anton Nikiforov
  • 3,345
  • 3
  • 13
  • 18
tombolinux
  • 198
  • 6
  • 1
    Would you like to explain the notation? What does it do with `ac.uk`? That's not a valid FQDN; it is a mid-level domain under the country-code TLD. – Jonathan Leffler Aug 04 '12 at 20:46
  • aa.com for example is an fqdn this regex matches only strings that are subdivided by dots and the last string is minimum 2 char. – tombolinux Aug 04 '12 at 21:07
  • With a regex you can only match a syntax, not a real dns fqdn. – tombolinux Aug 04 '12 at 21:20
  • 2
    The `?:(?!\d+\.)` should not be in there, as digit-only domains are still valid, like 911.com – Unixmonkey Jul 28 '13 at 00:39
  • 1
    @Unixmonkey - you are right, there are plenty of valid digit only subdomains. – bkr Nov 25 '13 at 21:03
  • Underscore is not allowed in host names, but it is in DNS labels. So you can have `_spf.example.com` but it cannot be the name of a host. Inside private networks, this is not enforced, but public DNS doesn't work with underscores in host names. – tripleee Mar 10 '15 at 14:21
  • There are also constraints on leading and consecutive dashes. They used to be prohibited, but now are allowed inside Punycode, according to specific rules. – tripleee Mar 10 '15 at 14:22
  • 888.com doesn't match this regex – Nati Jul 19 '17 at 13:33
4

We use this regex to validate domains which occur in the wild. It covers all practical use cases I know of. New ones are welcome. According to our guidelines it avoids non-capturing groups and greedy matching.

^(?!.*?_.*?)(?!(?:[\w]+?\.)?\-[\w\.\-]*?)(?![\w]+?\-\.(?:[\w\.\-]+?))(?=[\w])(?=[\w\.\-]*?\.+[\w\.\-]*?)(?![\w\.\-]{254})(?!(?:\.?[\w\-\.]*?[\w\-]{64,}\.)+?)[\w\.\-]+?(?<![\w\-\.]*?\.[\d]+?)(?<=[\w\-]{2,})(?<![\w\-]{25})$

Proof and explanation: https://regex101.com/r/FLA9Bv/40

There're two approaches to choose from when validating domains.

By-the-books FQDN matching (theoretical definition, rarely encountered in practice):

Practical / conservative FQDN matching (practical definition, expected and supported in practice):

  • by-the-books matching with the following exceptions/additions
  • valid characters: [a-zA-Z0-9.-]
  • labels cannot start or end with hyphens (as per RFC-952 and RFC-1123/2.1)
  • TLD min length is 2 character, max length is 24 character as per currently existing records
  • don't match trailing dot

The regex above contains both by-the-books and practical rules.

Community
  • 1
  • 1
thisismydesign
  • 21,553
  • 9
  • 123
  • 126
  • Note that "any characters are allowed" applies to labels in DNS in general. There are restrictions on what a valid host name is (RFC1123). Yes, in principle it is possible to create a PTR that maps an IP address to a piece of binary x86 code, but I would hesitate to let anyone fill that in over an API or in a form field, so I would apply RFC1123 restrictions. – Steven Mar 18 '20 at 15:21
  • 1
    Every `\w\d` or `\d\w` should be replaced with only `\w`, which is a proper superset of `\d`. – AndrewF Jun 11 '20 at 23:41
  • @Steven The regex is aimed at the practical use cases. Can you show examples to be excluded or included? – thisismydesign Jun 14 '20 at 10:58
  • @thisismydesign, if you allow any character, you could have `"; DROP *` or some such fun stuff as DNS labels or values. Assuming you are only dealing with host/domain names, RFC1123 restricts the allowed character set. Note that that also means `_` is not allowed. So `this-is-a-host.example.com` is fine, while `this_is_a_host.example.com` is not; neither is `-this-is-a-host-.example.com`. – Steven Dec 18 '20 at 14:50
  • 1
    @Steven Neither of those characters (`_ * ;`) are allowed by the regex. As mentioned, it contains practical rules as well. I suggest you try it out and if you find something that should or shouldn't be allowed let's discuss it. – thisismydesign Dec 19 '20 at 19:31
  • Apologies. I was thrown by the "any characters are allowed" phrase. Having said that, I cannot run this on Perl 5.30 (`Lookbehind longer than 255 not implemented in regex m/^`), but I do notice the use of `\w` class which includes the underscore... – Steven Dec 21 '20 at 10:15
  • 1
    @Steven You can try it via the regex101 link. And underscores are not allowed. The list of valid characters is in the answer. The regex is a bit more complex than to comprehend at first sight, so as I mentioned already, you should try it first. – thisismydesign Dec 22 '20 at 07:48
  • Ack. I just had a look. Missed the exclusion of `_` at the beginning. Clever. ;-) – Steven Dec 22 '20 at 10:47
3

CONSIDERATION #1:

Please note that due to relaxed requirements in RFC-2181 DNS labels can consist of pretty much any combination of symbols (however, the length restrictions are still there):

"Any binary string whatever can be used as the label of any resource record. Implementations of the DNS protocols must not place any restrictions on the labels that can be used. In particular, DNS servers must not refuse to serve a zone because it contains labels that might not be acceptable to some DNS client programs." (https://www.rfc-editor.org/rfc/rfc2181#section-11)

CONSIDERATION #2:

"There is an additional rule that essentially requires that top-level domain names not be all-numeric" (https://www.rfc-editor.org/rfc/rfc3696#section-2)

Taking into account these two considerations, the correct regex looks like this:

/^(?!:\/\/)(?=.{1,255}$)((.{1,63}\.){1,127}(?![0-9]*$)[a-z0-9-]+\.?)$/i

See demo @ http://regexr.com/3g5j0

Community
  • 1
  • 1
Anton Nikiforov
  • 3,345
  • 3
  • 13
  • 18
0

The following expression

(^((?=^.{4,253}$)(((http){0,1}|(http){0,1}|(ftp){0,1}|(ws){0,1})(s{0,1}):\/\/){0,1})((((?!-)[\pL0-9\-]{1,63})(?<!-)(\.)){1,})(((?!-)[a-z0-9\-]{1,63})(?<!-)((\/{0,1}[\pL\pN?=\-]*)+){1})$)

will match

https://www.tes1t.com/lets/to?878932572
https://www.test.co.uk/lets/to?878932572
http://www.test.com/lets/to?878932572
http://www.test.co.uk/lets/to?878932572
ftp://www.test.com/lets/to?878932572
subdomain.test.com/lets/to?878932572
subdomain.test.com/lets/to?878932572
subdomain.subdomain.test.net/lets/to?878932572

sub-domain.test.net/lets/to?878932572
sub-domain.test.net/lets-go/to?878932572
www.test.net/lets/to?878932572
www.test-test.com/
www.test-test.com

subdomain.subdomainsubdomainsuèdomainsubdomainsubdomainsubdomainsubdomain.net/let2s/to?=878932572

www.test-test.co.uk
http://www.test-test-.com/test
www.test-teèst.co.uk/lets
www.test-test.co.uk/lets/
www.test-test.co.uk/lets/to?
test-test.co.uk/lets/to?
test-test.co.uk/lets/
test-test.co.uk/lets
test-test.co.uk
http://test.com/lets/to?878932572
https://test.com/lets/to?878932572
ftp://test.com/lets/to?878932572
ftps://test.com/lets/to?878932572
ws://test.com/lets/to?878932572aa
wss://test.com/lets/to?=878932572bar
test.com

subdomain.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.khbdomainsubdomainsubdomain.test.net/lets/to?87893257

but not match:

www.-test-fail-.com
www.-test-fail.com
-test-fail.com
test-fail-.com

subdomain.subdomainsubdomainsubdomainsubdomainsubdomainsubdomainsubdomainsubdomainsubdomainsubdomainubdomainsubdomainsubdomain.test.net/lets/to?878932572

subdomain.subdomainsubdomainsubdcnvcnvcnofhfhghgfhvnhj-mainsubdomainsubdohhghghghfhgffgjh-gfhfdhfdghmainsubdocgvhngvnbnbmghghghaihgfjgfnfhfdghgsufghgghghhdfjgffsgfbdomainsubdomainsubdomainsubdomainsubdomainsubdomainsubdomain.test.net/lets/to?878932572

subdomain.test.test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test.khbdomainsubdomainsubdomain.test.net/lets/to?87893257
MrsPop88
  • 303
  • 4
  • 14