1

I need help to detect when a string contains 4-byte characters using PHP. Is there a built in function or regex that can efficiently do this?

I have found this article that talks about replacing, but I cannot find a working example that just detects.

Can php detect 4-byte encoded utf8 chars?

This is about as far as I got but it fails too:

$chars = str_split($term);
foreach ($chars as $char) {
    if (strlen($char) >= 4) {
        print "Found 4-byte character\n";
    }
}
Dharman
  • 30,962
  • 25
  • 85
  • 135
Wonko the Sane
  • 754
  • 3
  • 14
  • 31

2 Answers2

5

You can use regex to match all characters outside of BMP, which are all characters in Unicode space above U+FFFF

$str = '€A¢';

$r = preg_match_all('|[\x{10000}-\x{10FFFF}]|u', $str, $matches);

var_dump($matches[0]);

Try it here: https://3v4l.org/JX9aQ

Interesting fact. If you are using PHP 7.4 you can do that using mb_str_split() and array_filter(). I don't think it will be more efficient than regex, but good to know.

$nonBMP = array_filter(mb_str_split($str), fn($c) => strlen($c)==4);
Dharman
  • 30,962
  • 25
  • 85
  • 135
3

If you are working with utf8 characters, you must use the multibyte string functions. These functions allow you to display the number of bytes for each character in a string, similar to your code:

$string = '€A¢';
for($i=0; $i < mb_strlen($string); $i++){
  $mbChar = mb_substr($string,$i,1);
  echo $mbChar." (".strlen($mbChar)." Byte)<br>\n";
}

Output:

€ (3 Byte)
 (4 Byte)
A (1 Byte)
 (4 Byte)
¢ (2 Byte)

This answer is more for understanding. To find a 4-byte UTF8 character, regular expressions as shown by @Dharman are shorter and faster.

jspit
  • 7,276
  • 1
  • 9
  • 17