9

I’m currently working on a “proper” URI validator, and currently it all comes down to hostname validation; the rest isn’t that tricky.

I’m stuck on IDN hostname labels (i.e., containing Unicode; possible punycode encoded strings have been decoded at this point).

My first idea was basically one regex for TLDs which don’t support IDNs and one for those which do. This could perhaps be based on Mozilla’s list of IDN-enabled TLDs. Respectively, ^[a-zA-Z0-9\-]+$ and ^[a-zA-Z0-9\-\p{L}]+$. However, this is not an ideal situation, since every IDN registrar can decide which characters to allow.

What I’m looking for is a proper, consistent, up to date data table of the Unicode characters allowed in various TLDs. It’s beginning to look like I have to find all the data myself at Russian and Chinese registry sites (which is quite difficult).

So before I go trying to gather all this data myself, I wondered whether such a list already exists. Or are there better approaches, best/common practices, etc.? (I want the validation to be as strict as possible.)

TRiG
  • 10,148
  • 7
  • 57
  • 107
Roland Franssen
  • 1,038
  • 1
  • 11
  • 22

2 Answers2

4

IANA maintains a list of all of the codepoints and their status at https://www.iana.org/assignments/idna-tables-6.3.0/idna-tables-6.3.0.xhtml#idna-tables-properties

All of the ones marked PVALID are safe to use. The ones marked CONTEXTO or CONTEXTJ have more rules to follow. Read RFC5892 (IDNA) and RFC6452 (changing the status of a couple of characters) for all of the gory details.

Community
  • 1
  • 1
Joe Hildebrand
  • 10,354
  • 2
  • 38
  • 48
1

Can't you convert all Unicode domains to punycode and validate that? Since DNS doesn't support real UTF-8 chars anyways, this might be the best solution.

Byron Whitlock
  • 52,691
  • 28
  • 123
  • 168
  • True.. i thought of that too. However its about user input.. i cant tell my users to fill in uri's converted to punycode first. So that leaves me with (what you probably meant) converting it internally to punycode... still this not means the hostname has to be really valid (correct me if im wrong), so in that case matching any unicode character (\p{L}) and considering it as valid is basicly the same thing. The last option will be my fallback method if i cant come to a good solution; if this is going to be the case would you suggest holding on to the list mozilla provides (e.g. 2 regexes)? – Roland Franssen May 17 '10 at 19:45
  • To clearify above; TLD's listed on mozzilla -> [a-zA-Z0-9\-\p{L}] / All other TLD's -> [a-ZA-Z0-9\-] Would this be proper validation? – Roland Franssen May 17 '10 at 19:48
  • That depends on the encoder. Some encoders convert the input to *IDNA* and _should_ follow RFC5892. Other encoders convert to *punycode* and don't have to follow RFC5892. It's pretty easy to check, just enter a Klingon DNS name and if you receive punycode, the encoder does not follow RFC5892 (the Klingon alphabet is in an RFC5892 DISALLOWED code point range). – Klaws Jun 18 '20 at 17:56