I need to validate some user input that is encoded in UTF-8. Many have recommended using the following code:
preg_match('/\A(
[\x09\x0A\x0D\x20-\x7E]
| [\xC2-\xDF][\x80-\xBF]
| \xE0[\xA0-\xBF][\x80-\xBF]
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}
| \xED[\x80-\x9F][\x80-\xBF]
| \xF0[\x90-\xBF][\x80-\xBF]{2}
| [\xF1-\xF3][\x80-\xBF]{3}
| \xF4[\x80-\x8F][\x80-\xBF]{2}
)*\z/x', $string);
It's a regular expression taken from http://www.w3.org/International/questions/qa-forms-utf-8 . Everything was ok until I discovered a bug in PHP that seems to have been around at least since 2006. Preg_match() causes a seg fault if the $string is too long. There doesn't seem to be any workaround. You can view the bug submission here: http://bugs.php.net/bug.php?id=36463
Now, to avoid using preg_match I've created a function that does the exact same thing as the regular expression above. I don't know if this question is appropriate here at Stack Overflow, but I would like to know if the function I've made is correct. Here it is:
EDIT [13.01.2010]: If anyone is interested, there were several bugs in the previous version I've posted. Below is the final version of my function.
function check_UTF8_string(&$string) {
$len = mb_strlen($string, "ISO-8859-1");
$ok = 1;
for ($i = 0; $i < $len; $i++) {
$o = ord(mb_substr($string, $i, 1, "ISO-8859-1"));
if ($o == 9 || $o == 10 || $o == 13 || ($o >= 32 && $o <= 126)) {
}
elseif ($o >= 194 && $o <= 223) {
$i++;
$o2 = ord(mb_substr($string, $i, 1, "ISO-8859-1"));
if (!($o2 >= 128 && $o2 <= 191)) {
$ok = 0;
break;
}
}
elseif ($o == 224) {
$o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
$o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
$i += 2;
if (!($o2 >= 160 && $o2 <= 191) || !($o3 >= 128 && $o3 <= 191)) {
$ok = 0;
break;
}
}
elseif (($o >= 225 && $o <= 236) || $o == 238 || $o == 239) {
$o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
$o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
$i += 2;
if (!($o2 >= 128 && $o2 <= 191) || !($o3 >= 128 && $o3 <= 191)) {
$ok = 0;
break;
}
}
elseif ($o == 237) {
$o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
$o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
$i += 2;
if (!($o2 >= 128 && $o2 <= 159) || !($o3 >= 128 && $o3 <= 191)) {
$ok = 0;
break;
}
}
elseif ($o == 240) {
$o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
$o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
$o4 = ord(mb_substr($string, $i + 3, 1, "ISO-8859-1"));
$i += 3;
if (!($o2 >= 144 && $o2 <= 191) ||
!($o3 >= 128 && $o3 <= 191) ||
!($o4 >= 128 && $o4 <= 191)) {
$ok = 0;
break;
}
}
elseif ($o >= 241 && $o <= 243) {
$o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
$o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
$o4 = ord(mb_substr($string, $i + 3, 1, "ISO-8859-1"));
$i += 3;
if (!($o2 >= 128 && $o2 <= 191) ||
!($o3 >= 128 && $o3 <= 191) ||
!($o4 >= 128 && $o4 <= 191)) {
$ok = 0;
break;
}
}
elseif ($o == 244) {
$o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
$o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
$o4 = ord(mb_substr($string, $i + 3, 1, "ISO-8859-1"));
$i += 5;
if (!($o2 >= 128 && $o2 <= 143) ||
!($o3 >= 128 && $o3 <= 191) ||
!($o4 >= 128 && $o4 <= 191)) {
$ok = 0;
break;
}
}
else {
$ok = 0;
break;
}
}
return $ok;
}
Yes, it's very long. I hope I've understood correctly how that regular expression works. Also hope it will be of help to others.
Thanks in advance!