how to validate internationalized domain names

Question

I want to validate the domain url in php which may be in internationalized domain name format like in greek domain name= http://παράδειγμα.δοκιμή Is their any way to validate it using regular expression?

"Validate" as in "check if it's acceptable for DNS" (failures would be fairly rare) or as in "check if it actually exists in DNS" (failures would be common, given random input). — tripleee, Jan 14 '13 at 06:08
What is valid? Is it just `http://` followed by some characters, then a `.` followed by some characters? — Naveed S, Jan 14 '13 at 06:22
I just want to check if the DNS is valid or not. Is there any regex which can help me out here. The URL may have characters from other languages like German. e.g. yÄhoo.com. I am using this regex but it wont work for only alphanumeric characters. /^[a-z\d][a-z\d-]{0,62}$/i. How can I form regex which also acept the character from other languages, — user1969981, Jan 15 '13 at 04:20

score 3 · Answer 1 · edited Oct 07 '21 at 05:57

This is a so called IDN domain. Clients supporting IDN domains normalize it using IDNA2008 standard as specified in RFC 5890, then replace remaining unicode characters using Punycode encoding as defined in RFC 3492 before submission for DNS resolution.

By specification, literally every character in the UTF-8 character set is valid to use in a IDN domain, but every top level domain authority can define valid characters within the Unicode charset so it will be hard to create and maintain a real regex.

If you want to accept IDN domains in your application you should internally work with the encoded version. PHP extension intl brings two functions to en- and decode IDN domain names

echo idn_to_ascii('täst.de');

xn--tst-qla.de

After encoding, the domain, will pass any traditional regex check

Simple validation:

$url = "http://example.com/";
if (preg_match('/^(http|https|ftp):\/\/([A-Z0-9][A-Z0-9_-]*(?:\.[A-Z0-9][A-Z0-9_-]*)+):?(\d+)?\/?/i', $url)) {
    echo 'OK';
} else {
    echo 'Invalid URL.';
}

EDIT:

If you want a real DNS verfification you can use dns_get_record (PHP 5) or gethostbyaddr

e.g.

$domain = 'ελληνικά.idn.icann.org';
$idnDomain = idn_to_ascii( $domain );

if ( $dnsResult = dns_get_record( $idnDomain, DNS_ANY ) )
{
    echo $idnDomain , "\n";
    print_r( $dnsResult );
}
else
{
    echo "failed to lookup domain\n";
}

Result:

xn--hxargifdar.idn.icann.org
Array 
(
    [0] => Array
    (
        [host] => xn--hxargifdar.idn.icann.org
        [class] => IN
        [ttl] => 21456
        [type] => A
        [ip] => 199.7.85.10
    )
    [1] => Array
    (
        [host] => xn--hxargifdar.idn.icann.org
        [class] => IN
        [ttl] => 21600
        [type] => AAAA
        [ipv6] => 2620::2830:230:0:0:0:10
    )
)

I *think* I found an *important* error in your answer. You say: `By specification, literally every character in the UTF-8 character set is valid to use in a IDN domain` (whilst you talk about IDNA2008 and RFC5890). *HOWEVER* (in my understanding), IDNA2008 now `disallows about eight thousand characters that used to be valid, including all uppercase characters, full/half-width variants, symbols, and punctuation` (previously allowed in IDNA2003 and at the moment still work in most implementations). See http://www.unicode.org/faq/idn.html & http://tools.ietf.org/html/rfc5892 .Or did I misread it? — GitaarLAB, Jun 21 '13 at 12:33
@Gitaar thanks, yes you're right. This is new to me but absolutely makes sense, because domain names are case insensitive, and punctuation characters might be reserved (e.g. `dot` domain delimiter, `?` query string delimiter etc. — Michel Feldheim, Jul 12 '13 at 18:26

score 3 · Answer 2 · edited Mar 30 '21 at 08:14

If you want to create your own library, you need to use the table of permitted codepoints (IANA — Repository of IDN Practices, IDN Character Validation Guidance, IDNA Parameters) and the table of Unicode Script properties (UNIDATA/Scripts.txt).

Gmail adopts the Unicode Consortium’s “Highly Restricted” specification (Protecting Gmail in a global world). The following combinations of Unicode Scripts are permitted.

Single script
Latin + Han + Hiragana + Katakana
Latin + Han + Bopomofo
Latin + Han + Hangul

You may need to pay attention to special script property values (Common, Inherited, Unknown) since some of characters has multiple properties or wrong properties.

For example, U+3099 (COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK) has two properties ("Katakana" and "Hiragana") and PCRE function classify it as "Inherited". Another example is U+x2A708. Although the right script property of U+2A708(combination of U+30C8 KATAKANA LETTER TO and U+30E2 KATAKANA LETTER MO) is "Katakana", The Unicode Specification misclassify it as "Han".

You may need to consider IDN homograph attack. Google Chrome's IDN policy adopts the blacklist chars.

My recommendation is to use Zend\Validator\Hostname. This library uses the table of permitted code points for Japanese and Chinese.

If you use Symfony, consider upgrade the app of version to 2.5 which adopts egulias/email-validatornd (Manual). You need extra validation whether the string is well-formed byte sequence. See my reporta> for the detail.

Don't forget XSS and SQL injection. The following address is valid email address based RFC5322.

// From Japanese tutorial
// http://blog.tokumaru.org/2013/11/xsssqlrfc5322.html
"><script>alert('or/**/1=1#')</script>"@example.jp

I think it's doubtful for using idn_to_ascii for validation since idn_to_ascii passes almost all characters.

for ($i = 0; $i < 0x110000; ++$i) {
    $c = utf8_chr($i);

    if ($c !== '' && false !== idn_to_ascii($c)) {
        $number = strtoupper(dechex($i));
        $length = strlen($number);

        if ($i < 0x10000) {
            $number = str_repeat('0', 4 - $length).$number;
        }
    
        $idn = $c.'example.com';

        echo 'U+'.$number.' ';
        echo ' '.$idn.' '. idn_to_ascii($idn);
        echo PHP_EOL;
    }
}

function utf8_chr($code_point) {

    if ($code_point < 0 || 0x10FFFF < $code_point || (0xD800 <= $code_point && $code_point <= 0xDFFF)) {
        return '';
    }

    if ($code_point < 0x80) {
        $hex[0] = $code_point;
        $ret = chr($hex[0]);
    } else if ($code_point < 0x800) {
        $hex[0] = 0x1C0 | $code_point >> 6;
        $hex[1] = 0x80  | $code_point & 0x3F;
        $ret = chr($hex[0]).chr($hex[1]);
    } else if ($code_point < 0x10000) {
        $hex[0] = 0xE0 | $code_point >> 12;
        $hex[1] = 0x80 | $code_point >> 6 & 0x3F;
        $hex[2] = 0x80 | $code_point & 0x3F;
        $ret = chr($hex[0]).chr($hex[1]).chr($hex[2]);
    } else  {
        $hex[0] = 0xF0 | $code_point >> 18;
        $hex[1] = 0x80 | $code_point >> 12 & 0x3F;
        $hex[2] = 0x80 | $code_point >> 6 & 0x3F;
        $hex[3] = 0x80 | $code_point  & 0x3F;
        $ret = chr($hex[0]).chr($hex[1]).chr($hex[2]).chr($hex[3]);
    }

    return $ret;
}

If you want to validate domain by Unicode Script properties, use PCRE functions.

The following code show how to get the name of Unicode script property. If you want to the the Unicode Script properties in JavaScript, use mathiasbynens/unicode-data.

function get_unicode_script_name($c) {

  // http://php.net/manual/regexp.reference.unicode.php
  $names = [
    'Arabic', 'Armenian', 'Avestan', 'Balinese', 'Bamum', 'Batak', 'Bengali', 
    'Bopomofo', 'Brahmi', 'Braille', 'Buginese', 'Buhid', 'Canadian_Aboriginal',
    'Carian', 'Chakma', 'Cham', 'Cherokee', 'Common', 'Coptic', 'Cuneiform',
    'Cypriot', 'Cyrillic', 'Deseret', 'Devanagari', 'Egyptian_Hieroglyphs',
    'Ethiopic', 'Georgian', 'Glagolitic', 'Gothic', 'Greek', 'Gujarati', 
    'Gurmukhi', 'Han', 'Hangul', 'Hanunoo', 'Hebrew', 'Hiragana', 'Imperial_Aramaic',
    'Inherited', 'Inscriptional_Pahlavi', 'Inscriptional_Parthian', 'Javanese',
    'Kaithi', 'Kannada', 'Katakana', 'Kayah_Li', 'Kharoshthi', 'Khmer', 'Lao', 'Latin',
    'Lepcha', 'Limbu', 'Linear_B', 'Lisu', 'Lycian', 'Lydian', 'Malayalam', 'Mandaic',
    'Meetei_Mayek', 'Meroitic_Cursive', 'Meroitic_Hieroglyphs', 'Miao', 'Mongolian',
    'Myanmar', 'New_Tai_Lue', 'Nko', 'Ogham', 'Old_Italic', 'Old_Persian',
    'Old_South_Arabian', 'Old_Turkic', 'Ol_Chiki', 'Oriya', 'Osmanya', 'Phags_Pa',
    'Phoenician', 'Rejang', 'Runic', 'Samaritan', 'Saurashtra', 'Sharada', 'Shavian',
    'Sinhala', 'Sora_Sompeng', 'Sundanese', 'Syloti_Nagri', 'Syriac', 'Tagalog',
    'Tagbanwa', 'Tai_Le', 'Tai_Tham', 'Tai_Viet', 'Takri', 'Tamil', 'Telugu', 'Thaana',
    'Thai', 'Tibetan', 'Tifinagh', 'Ugaritic', 'Vai', 'Yi'
  ];

  $ret = [];

  foreach ($names as $name) {

    $pattern = '/\p{'.$name.'}/u';

    if (preg_match($pattern, $c)) {
        return $name;
    }
  }

  return '';
}

GreenRover · Answer 3 · 2013-01-15T07:53:00.890

This are idn domains, i would first convert it to the puny code version and validate the domains then.

But if you realy like to validate an by regex

<?php

$domain = 'παράδειγμα.gr';
$regex = '#^([\w-]+://?|www[\.])?([^\-\s\,\;\:\+\/\\\?\^\`\=\&\%\"\'\*\#\<\>]*)\.[a-z]{2,7}$#';
if (preg_match($regex, $domain)) {
    echo "VALID";
}

But this you let you run in false possitives, because it is realy complex to validate an idn domain i tryed to validate that no invalid chars are within, but the list is NOT complete.

Better convert bevore to punny code

$regex = '#^([\w-]+://?|www[\.])?[a-z0-9]+[a-z0-9\-\.]*[a-z0-9]+\.[a-z]{2,7}$#';
if (preg_match($regex, idn_to_ascii($domain))) {
    echo "VALID";
}

And if you additional want to test if the domain could be resolved try:

$regex = '#^([\w-]+://?|www[\.])?[a-z0-9]+[a-z0-9\-\.]*[a-z0-9]+\.[a-z]{2,7}$#';
$punny_domain = idn_to_ascii($domain);
if (preg_match($regex, $punny_domain)) {
    if (gethostbyname($punny_domain) != $punny_domain) {
        echo "VALID";
    }
}

how to validate internationalized domain names

3 Answers3

Linked

Related