How to convert Bytes(UTF-8) to Unicode in php?

Question

How can i convert

\xF0\x9F\x98\x83

to

\u1F603

in php?

PS: it's a Emoji -> , i need Unicode to use Twemoji.

UTF-8 *is* Unicode, your question doesn't make sense. Also, those values you mention there, they are escape sequences that represent the same thing in different ways. — Ulrich Eckhardt, May 09 '15 at 17:39
@UlrichEckhardt Sorry, i'm not good in english. Please have a look at this link: [WordPress smilies_init()](https://developer.wordpress.org/reference/functions/smilies_init/) . i want the value of `$wpsmiliestrans` into http://twemoji.maxcdn.com/36x36/2764.png — Cople, May 10 '15 at 07:45

score 2 · Accepted Answer · edited May 23 '17 at 12:06

Interesting, not much is out there for PHP. There seems to be a promising post, but unfortunately the accepted answer gives incorrect results in Your case.

So here's a revised version of Adam's solution rewritten in PHP.

/**
 * Translates a sequence of UTF-8 bytes to their equivalent unicode code points.
 * Each code point is prefixed with "\u".
 *
 * @param string $utf8
 *
 * @return string
 */
function utf8_to_unicode($utf8) {
    $i = 0;
    $l = strlen($utf8);

    $out = '';

    while ($i < $l) {
        if ((ord($utf8[$i]) & 0x80) === 0x00) {
            // 0xxxxxxx
            $n = ord($utf8[$i++]);
        } elseif ((ord($utf8[$i]) & 0xE0) === 0xC0) {
            // 110xxxxx 10xxxxxx
            $n =
                ((ord($utf8[$i++]) & 0x1F) <<  6) |
                ((ord($utf8[$i++]) & 0x3F) <<  0)
            ;
        } elseif ((ord($utf8[$i]) & 0xF0) === 0xE0) {
            // 1110xxxx 10xxxxxx 10xxxxxx
            $n =
                ((ord($utf8[$i++]) & 0x0F) << 12) |
                ((ord($utf8[$i++]) & 0x3F) <<  6) |
                ((ord($utf8[$i++]) & 0x3F) <<  0)
            ;
        } elseif ((ord($utf8[$i]) & 0xF8) === 0xF0) {
            // 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
            $n =
                ((ord($utf8[$i++]) & 0x07) << 18) |
                ((ord($utf8[$i++]) & 0x3F) << 12) |
                ((ord($utf8[$i++]) & 0x3F) <<  6) |
                ((ord($utf8[$i++]) & 0x3F) <<  0)
            ;
        } elseif ((ord($utf8[$i]) & 0xFC) === 0xF8) {
            // 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
            $n =
                ((ord($utf8[$i++]) & 0x03) << 24) |
                ((ord($utf8[$i++]) & 0x3F) << 18) |
                ((ord($utf8[$i++]) & 0x3F) << 12) |
                ((ord($utf8[$i++]) & 0x3F) <<  6) |
                ((ord($utf8[$i++]) & 0x3F) <<  0)
            ;
        } elseif ((ord($utf8[$i]) & 0xFE) === 0xFC) {
            // 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
            $n =
                ((ord($utf8[$i++]) & 0x01) << 30) |
                ((ord($utf8[$i++]) & 0x3F) << 24) |
                ((ord($utf8[$i++]) & 0x3F) << 18) |
                ((ord($utf8[$i++]) & 0x3F) << 12) |
                ((ord($utf8[$i++]) & 0x3F) <<  6) |
                ((ord($utf8[$i++]) & 0x3F) <<  0)
            ;
        } else {
            throw new \Exception('Invalid utf-8 code point');
        }

        $n = strtoupper(dechex($n));
        $pad = strlen($n) <= 4 ? strlen($n) + strlen($n) %2 : 0;
        $n = str_pad($n, $pad, "0", STR_PAD_LEFT);

        $out .= sprintf("\u%s", $n);
    }

    return $out;
}

And in your case

php > var_dump(utf8_to_unicode("\xF0\x9F\x98\x83"));
string(7) "\u1F603"

Please, oh _please_, call it `utf8_to_utf16`. Both are "Unicode" in the way that both are representations for Unicode code points. — DarkDust, May 10 '15 at 07:01
@DarkDust Why "utf16"? It doesn't produce UTF-16 code units. It arguably doesn't produce UTF-32 either because it performs no validation. — 一二三, May 10 '15 at 07:49
I'd suggest a few other names (none of which are nice) for this. For example, it doesn't validate continuation bytes and accepts up to six bytes for a single codepoint, both of which violates UTF-8. Also, the output is surely not UTF-16, because that would require at least two characters of 16 bit each to represent the char. I would say using "iconv" instead would be a better alternative. — Ulrich Eckhardt, May 10 '15 at 07:53
The more I look at this function, the more WTFs crop up. The [`\u` escape sequence produces UTF-8 sequences](https://wiki.php.net/rfc/unicode_escape). So, this function takes a UTF-8 encoded character and outputs a _string_ like `\u1234`, which in turn would evaluate to an UTF-8 sequence when used as a printf format? What's the point of this? — DarkDust, May 10 '15 at 13:27
@DarkDust it does indeed lack some documentation and the naming is off. What this function "tries" to do is to convert a UTF-8 encoded sequence of characters into their respective *literal* unicode code points. I was not aware of `\u`'s special meaning unfortunately. Notice how the code point *must* be wrapped in curly braces for PHP to interpret it as a codepoint. So the resemblance is unfortunately coincidental. The name should probably be along the lines of `utf8_to_literal_unicode_code_points`. — kgilden, May 10 '15 at 14:33

score 1 · Answer 2 · answered May 09 '15 at 14:21

1

Use a combination of:

stripcslashes() to convert \xFF byte escapes.
That'll result in a string of UTF-8, because that's what it seemingly was originally.
json_encode() to convert "" back to an \uFFFF Unicode escape.
If that's what you want to end up with. (Not enough context in your question to tell.)

answered May 09 '15 at 14:21

mario

144,265
20
237
291

3Q. but it does not convert "\xF0\x9F\x98\x83" to "\u1F603" by `json_encode(stripcslashes("\xF0\x9F\x98\x83"))`, the result is "\ud83d\ude03"; In this page [WordPress smilies_init()](https://developer.wordpress.org/reference/functions/smilies_init/) you can find an array -> `$wpsmiliestrans`; i need the value convert to a unicode, so i can create a image link like "twemoji.maxcdn.com/36x36/2764.png", the `2764` is part of unicode. sorry foy my bad english. – Cople May 10 '15 at 07:37

How to convert Bytes(UTF-8) to Unicode in php?

2 Answers2