3

Thare are new nations domains and TLDs like "http://президент.рф/" - for Russian Federation domains, or http://example.新加坡 for Singapore...

Is there a regex to validate these domains?

I have found this one: What is the best regular expression to check if a string is a valid URL?

But when I try to use one of the expressions listed there - PHP is getting overhitted :)

preg_match(): Compilation failed: character value in \x{...} sequence is too large at offset 81

P.S.

1) Last part was solved by @OmnipotentEntity

2) But the main problem - to validate international domain - still exists, because example regexp doesn't validate well.

Community
  • 1
  • 1
Alex Kirs
  • 339
  • 2
  • 13

2 Answers2

3

Use the "u" modifier to match unicode characters. The example you gave only uses the "i" modifier.

OmnipotentEntity
  • 16,531
  • 6
  • 62
  • 96
  • But the top problem still exists. The regexp in example validates domain 'http://$$$президент!!!.рф' as a valid domain :( – Alex Kirs Dec 02 '10 at 18:58
  • I'm not going to lie, looking at that regular expression makes my brain hurt. But it looks like $ and ! are explicitly allowed in the domain name by the regular expression you're using. I double checked rfc3986 and 3987, 3986 referenced (via 1123) 952, which defines domain names. But this was preinternationalization (written in 1985!) I don't know what characters are allowed postinternationalization, but if you don't want $ and ! to validate in the domain name simply take them out. – OmnipotentEntity Dec 03 '10 at 17:58
  • Thats why this question has no solid answer yet :) Who knows what other chatacters that regexp misses, and what else need to be added to make it perfect. – Alex Kirs Dec 06 '10 at 22:36
2

No, there's no regexp to validate those domains. Each TLD has different rules about which Unicode code points are permissible within their IDNs (if any). You would need a very big lookup table which would have to be kept up-to-date to know which specific characters are legal.

Furthermore there are rules about whether left-to-right written characters and right-to-left characters can be combined within a single DNS label.

BTW, the RFCs mentioned in the other comments are obsolete. The recently approved set are RFCs 5890 - 5895.

Alnitak
  • 334,560
  • 70
  • 407
  • 495