20

I am using a utf8 charset mysql tables in a mysql 5.1 server, which does not support utf8mb4 encoding in tables. When inserting 4-byte encoded utf8 characters like "","","","","","唧","". The table will popup error or skip the following texts.

How can I programmatically detect 4-byte encoded utf8 characters in PHP and replace them?

Abby Chau Yu Hoi
  • 1,378
  • 3
  • 15
  • 37
  • Pretty simple: split a string by characters (many ways to do so) and check if `strlen($char) == 4`. Not sure if this is really the correct way to detect the characters MySQL can't handle though, going by code point may be more accurate. – deceze May 11 '13 at 11:26
  • Have you checked out the [multibyte extension](http://php.net/mbstring)? Also, be sure to always [read the comments](http://dk1.php.net/manual/en/function.mb-internal-encoding.php#66568). – Sverri M. Olsen May 11 '13 at 11:30
  • @deceze That's an approach. I will go for that if there aren't other elegant ways. – Abby Chau Yu Hoi May 11 '13 at 11:36
  • See [this related question](http://stackoverflow.com/questions/10798605/warning-raised-by-inserting-4-byte-unicode-to-mysql); I know it's Python, but you could use a regex to check for 4-byte characters. – cmbuckley May 11 '13 at 11:36
  • @cbuckley do you know is \U also valid in php? – Abby Chau Yu Hoi May 11 '13 at 11:50
  • sorry to revive such an old topic, but as far as I can tell, the characters you list aren't 4, but 3 bytes in UTF8 ;) – codeling Sep 12 '15 at 07:14
  • @codeling they require 4-byte containers anyway. :) Thanks for your information! – Abby Chau Yu Hoi Sep 15 '15 at 01:32
  • @AbbyChauYuHoi you mean they require "utf8mb4" types in mysql, do they? I thought the "utf8" types in mysql could store up to 3 byte characters? if they can't I have to rethink my current work as well ;) – codeling Sep 15 '15 at 07:45
  • @codeling I believe what i am listing are suitable to test in this question. and plx be noticed that x-byte unicode chars are requiring x+1 bytes in UTF8 . – Abby Chau Yu Hoi Sep 17 '15 at 01:55

2 Answers2

19

The following regular expression will replace 4-byte UTF-8 characters:

function replace4byte($string, $replacement = '') {
    return preg_replace('%(?:
          \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3
        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
        | \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16
    )%xs', $replacement, $string);    
}

var_dump(replace4byte('d'), replace4byte('dd'));

This doesn't rely on the /u modifier, so you shouldn't need to worry about UTF-8 for PCRE being compiled in. However, if you have that support, deceze's preg_replace_callback is neater.

(Regex adapted from Ensuring valid utf-8 in PHP)

cmbuckley
  • 40,217
  • 9
  • 77
  • 91
18

This should work:

if (max(array_map('ord', str_split($string))) >= 240) 

The rational being that code points up to and including U+FFFF are encoded as three bytes of the form 1110xxxx 10xxxxxx 10xxxxxx. Higher code points are of the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx, i.e. the highest byte has a value of 240 or higher. If there are any such bytes in the string, it's an indicator for a 4-byte sequence.

If you want to remove long characters, this will do:

preg_replace_callback('/./u', function (array $match) {
    return strlen($match[0]) >= 4 ? null : $match[0];
}, $string)

Though there may be a more elegant regex way to express high codepoints directly.

Community
  • 1
  • 1
deceze
  • 510,633
  • 85
  • 743
  • 889
  • Thanks for detection but can you finish it with a replacement example too? $a = "omg, I cannot insert into my table, blahblahblah"; //target $a == "omg, I cannot insert MYTEXT into my table, blahblahblah"; – Abby Chau Yu Hoi May 11 '13 at 11:48