-1

If is somewhere in the string a non-UTF8 character, preg_match with the modifier u returns false for an error. For example:

<?php
$string = "ABCD\xc3";
$r = preg_match('/^./u',$string, $match);
var_dump($r);  //bool(false)

This example for try yourself: https://3v4l.org/qkHl4

The regular expression finds the first character if the non-UTF8 character is removed at the end.

$string = "ABCD";
$r = preg_match('/^./u',$string, $match);
var_dump($r, $match); 
//int(1) array(1) { [0]=> string(1) "A" }

Is there an easy way to use regular expressions to identify a UTF-8 character at the beginning for strings that also contain non-UTF8 characters?

jspit
  • 7,276
  • 1
  • 9
  • 17
  • 1
    I had also expected for the first example that the character "A" is found. I suspect preg_match first checks the whole string if all characters are UTF-8 before using the regular expression. – jspit Nov 05 '19 at 09:36
  • According to this [answer](https://stackoverflow.com/a/8215387/3278362) you can remove invalid utf characters using [mb_convert_encoding](https://www.php.net/manual/en/function.mb-convert-encoding.php) – marcell Nov 05 '19 at 09:42
  • See this: https://3v4l.org/iYiah – anubhava Nov 05 '19 at 09:43
  • 2
    If your string is not valid UTF-8, why use `u` mode and expect it to work…? – deceze Nov 05 '19 at 09:51
  • 1
    This "\u{c3}" is a UTF-8 character "Ã" and this "\xc3" is not a UTF-8 character. I have strings which contains also NON-UTF8 chars. – jspit Nov 05 '19 at 09:53
  • 1
    If your string contains "non-UTF-8 characters", then it's not a valid UTF-8/Unicode string, and `u` expects and only works on UTF-8 strings. You either have a valid UTF-8 string and can then use Unicode regexen on it, or you don't and you can't. – deceze Nov 05 '19 at 09:58
  • mb_substr() with encoding UTF-8 can handle invalid UTF-8 strings. This behavior is unfortunately not documented. That's why I'm looking for another solution. – jspit Nov 05 '19 at 10:33

3 Answers3

0

Based on this answer you can remove invalid utf characters using mb_convert_encoding:

$string = "ABCD\xc3";
$string = mb_convert_encoding($string, 'UTF-8', 'UTF-8');
$r = preg_match('/^./u', $string, $match);
var_dump($r, $match);

give the following result:

int(1)
array(1) {
  [0] =>
  string(1) "A"
}
marcell
  • 1,498
  • 1
  • 10
  • 22
  • 1
    Unfortunately, does not help me. The line with mb_convert_encoding replaces all non-UTF8 with "?". So I can not say if the first character is a valid UTF-8 or not. Note: To parse a escaped character in hexadecimal notation how "\xc3" the string must enclosed in double-quotes ("). – jspit Nov 05 '19 at 13:18
0

You could also consider using T-Regx, which handles UTF8 errors in a more cooperative way:

try {
    pattern('^.', 'u')->match("ABCD\xc3")->all();
catch (SafeRegexException $e) {
    // handle
}
Danon
  • 2,771
  • 27
  • 37
0

I think after a long search I found an answer myself.

The modifier u works only if the entire string is a valid UTF-8 string. Even if only the first character is to be found, the entire string is checked first. The modifier u can not be used for this problem. However, regular expressions can be used.

function utf8Char($string){
    $ok = preg_match(
      '/^[\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF]
      |^[\xE0-\xEF][\x80-\xBF][\x80-\xBF]
      |^[\xC0-\xDF][\x80-\xBF]
      |^[\x00-\x7f]/sx',
      $string,
      $match);
    return $ok ? $match[0] : false;      
}


var_dump(utf8char("€a\xc3def"));  //string(3) "€"
var_dump(utf8char("a\xc3def"));  //string(1) "a"
var_dump(utf8char("\xc3def"));  //bool(false)

The non-UTF8-bytes can be retrieved using the substr function.

var_dump(substr("\xc3def",0,1)); //string(1) "�"
jspit
  • 7,276
  • 1
  • 9
  • 17