preg_match does not find a UTF-8 character at the beginning of a binary string which contain non-UTF8 characters

Question

If is somewhere in the string a non-UTF8 character, preg_match with the modifier u returns false for an error. For example:

<?php
$string = "ABCD\xc3";
$r = preg_match('/^./u',$string, $match);
var_dump($r);  //bool(false)

This example for try yourself: https://3v4l.org/qkHl4

The regular expression finds the first character if the non-UTF8 character is removed at the end.

$string = "ABCD";
$r = preg_match('/^./u',$string, $match);
var_dump($r, $match); 
//int(1) array(1) { [0]=> string(1) "A" }

Is there an easy way to use regular expressions to identify a UTF-8 character at the beginning for strings that also contain non-UTF8 characters?

I had also expected for the first example that the character "A" is found. I suspect preg_match first checks the whole string if all characters are UTF-8 before using the regular expression. — jspit, Nov 05 '19 at 09:36
According to this [answer](https://stackoverflow.com/a/8215387/3278362) you can remove invalid utf characters using [mb_convert_encoding](https://www.php.net/manual/en/function.mb-convert-encoding.php) — marcell, Nov 05 '19 at 09:42
If your string is not valid UTF-8, why use `u` mode and expect it to work…? — deceze, Nov 05 '19 at 09:51
This "\u{c3}" is a UTF-8 character "Ã" and this "\xc3" is not a UTF-8 character. I have strings which contains also NON-UTF8 chars. — jspit, Nov 05 '19 at 09:53
If your string contains "non-UTF-8 characters", then it's not a valid UTF-8/Unicode string, and `u` expects and only works on UTF-8 strings. You either have a valid UTF-8 string and can then use Unicode regexen on it, or you don't and you can't. — deceze, Nov 05 '19 at 09:58
mb_substr() with encoding UTF-8 can handle invalid UTF-8 strings. This behavior is unfortunately not documented. That's why I'm looking for another solution. — jspit, Nov 05 '19 at 10:33

marcell · Answer 1 · 2019-11-05T13:32:37.720

0

Based on this answer you can remove invalid utf characters using mb_convert_encoding:

$string = "ABCD\xc3";
$string = mb_convert_encoding($string, 'UTF-8', 'UTF-8');
$r = preg_match('/^./u', $string, $match);
var_dump($r, $match);

give the following result:

int(1)
array(1) {
  [0] =>
  string(1) "A"
}

edited Nov 05 '19 at 13:32

answered Nov 05 '19 at 09:48

marcell

1,498
1
10
22

1

Unfortunately, does not help me. The line with mb_convert_encoding replaces all non-UTF8 with "?". So I can not say if the first character is a valid UTF-8 or not. Note: To parse a escaped character in hexadecimal notation how "\xc3" the string must enclosed in double-quotes ("). – jspit Nov 05 '19 at 13:18

score 0 · Answer 2 · answered Nov 05 '19 at 09:57

0

You could also consider using T-Regx, which handles UTF8 errors in a more cooperative way:

try {
    pattern('^.', 'u')->match("ABCD\xc3")->all();
catch (SafeRegexException $e) {
    // handle
}

answered Nov 05 '19 at 09:57

Danon

2,771
27
37

score 0 · Accepted Answer · answered Nov 27 '19 at 07:53

I think after a long search I found an answer myself.

The modifier u works only if the entire string is a valid UTF-8 string. Even if only the first character is to be found, the entire string is checked first. The modifier u can not be used for this problem. However, regular expressions can be used.

function utf8Char($string){
    $ok = preg_match(
      '/^[\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF]
      |^[\xE0-\xEF][\x80-\xBF][\x80-\xBF]
      |^[\xC0-\xDF][\x80-\xBF]
      |^[\x00-\x7f]/sx',
      $string,
      $match);
    return $ok ? $match[0] : false;      
}


var_dump(utf8char("€a\xc3def"));  //string(3) "€"
var_dump(utf8char("a\xc3def"));  //string(1) "a"
var_dump(utf8char("\xc3def"));  //bool(false)

The non-UTF8-bytes can be retrieved using the substr function.

var_dump(substr("\xc3def",0,1)); //string(1) "�"

preg_match does not find a UTF-8 character at the beginning of a binary string which contain non-UTF8 characters

3 Answers3