0

I use the below code to convert an emoji to unicode, but how do i apply this to a string without affecting the other text in string.

function emoji_to_unicode($emoji) {
   $emoji = mb_convert_encoding($emoji, 'UTF-32', 'UTF-8');
   $unicode = strtoupper(preg_replace("/^[0]+/","U+",bin2hex($emoji)));
   return $unicode;
}

$var = "";
echo emoji_to_unicode($var);

If $var is hello , goodbye then the output is U+68000000650000006C0000006C0000006F000000200001F6000000002C00000020000000670000006F0000006F00000064000000620000007900000065

When it should be hello U+1F600, goodbye

user892134
  • 3,078
  • 16
  • 62
  • 128
  • Just as a terminology note, it doesn't make much sense to call this "converting to Unicode". In order for this to work, the *input* string must be in some form of Unicode; what you're doing is converting from that to some kind of ASCII-safe escaped-Unicode. – IMSoP Mar 25 '22 at 15:23
  • If you can tell use _why_ you want to do this, we can likely offer a better solution. Eg: if your emojis keep getting "corrupted" by mysql, then you should be using the 'utf8mb4' encoding, not the misleadingly-named 'utf8' encoding. – Sammitch Mar 25 '22 at 18:57

1 Answers1

1

Emojis are not 1-byte characters like 123abc@#$^, they are characters with 4 bytes so you can't remove them with unicode range or something like this. But you can select every character with 4 bytes:

function to_unicode($text) {
    $str = preg_replace_callback(
        "%(?:\xF0[\x90-\xBF][\x80-\xBF]{2} | [\xF1-\xF3][\x80-\xBF]{3} | \xF4[\x80-\x8F][\x80-\xBF]{2})%xs",
        function($emoji){
            $emojiStr = mb_convert_encoding($emoji[0], 'UTF-32', 'UTF-8');
            return strtoupper(preg_replace("/^[0]+/","U+",bin2hex($emojiStr)));
        },
        $text
    );
    return $str;
}

echo to_unicode( 'hello world ' );

output is hello world U+1F600

How it's working

First of all, you have to check 4 bytes characters with regex:

%(?:\xF0[\x90-\xBF][\x80-\xBF]{2} | [\xF1-\xF3][\x80-\xBF]{3} | \xF4[\x80-\x8F][\x80-\xBF]{2})%xs

and using preg_replace_callback function.

then use a callback function to encode selected character:

function($emoji){
   $emojiStr = mb_convert_encoding($emoji[0], 'UTF-32', 'UTF-8');
   return strtoupper(preg_replace("/^[0]+/","U+",bin2hex($emojiStr)));
}

resources:

Detect emoji (stackoverflow)

What is emoji?

HOSSEIN B
  • 301
  • 2
  • 7