2

How can i convert

\xF0\x9F\x98\x83

to

\u1F603

in php?

PS: it's a Emoji -> , i need Unicode to use Twemoji.

Cople
  • 317
  • 4
  • 13
  • 1
    UTF-8 *is* Unicode, your question doesn't make sense. Also, those values you mention there, they are escape sequences that represent the same thing in different ways. – Ulrich Eckhardt May 09 '15 at 17:39
  • @UlrichEckhardt Sorry, i'm not good in english. Please have a look at this link: [WordPress smilies_init()](https://developer.wordpress.org/reference/functions/smilies_init/) . i want the value of `$wpsmiliestrans` into http://twemoji.maxcdn.com/36x36/2764.png – Cople May 10 '15 at 07:45

2 Answers2

2

Interesting, not much is out there for PHP. There seems to be a promising post, but unfortunately the accepted answer gives incorrect results in Your case.

So here's a revised version of Adam's solution rewritten in PHP.

/**
 * Translates a sequence of UTF-8 bytes to their equivalent unicode code points.
 * Each code point is prefixed with "\u".
 *
 * @param string $utf8
 *
 * @return string
 */
function utf8_to_unicode($utf8) {
    $i = 0;
    $l = strlen($utf8);

    $out = '';

    while ($i < $l) {
        if ((ord($utf8[$i]) & 0x80) === 0x00) {
            // 0xxxxxxx
            $n = ord($utf8[$i++]);
        } elseif ((ord($utf8[$i]) & 0xE0) === 0xC0) {
            // 110xxxxx 10xxxxxx
            $n =
                ((ord($utf8[$i++]) & 0x1F) <<  6) |
                ((ord($utf8[$i++]) & 0x3F) <<  0)
            ;
        } elseif ((ord($utf8[$i]) & 0xF0) === 0xE0) {
            // 1110xxxx 10xxxxxx 10xxxxxx
            $n =
                ((ord($utf8[$i++]) & 0x0F) << 12) |
                ((ord($utf8[$i++]) & 0x3F) <<  6) |
                ((ord($utf8[$i++]) & 0x3F) <<  0)
            ;
        } elseif ((ord($utf8[$i]) & 0xF8) === 0xF0) {
            // 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
            $n =
                ((ord($utf8[$i++]) & 0x07) << 18) |
                ((ord($utf8[$i++]) & 0x3F) << 12) |
                ((ord($utf8[$i++]) & 0x3F) <<  6) |
                ((ord($utf8[$i++]) & 0x3F) <<  0)
            ;
        } elseif ((ord($utf8[$i]) & 0xFC) === 0xF8) {
            // 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
            $n =
                ((ord($utf8[$i++]) & 0x03) << 24) |
                ((ord($utf8[$i++]) & 0x3F) << 18) |
                ((ord($utf8[$i++]) & 0x3F) << 12) |
                ((ord($utf8[$i++]) & 0x3F) <<  6) |
                ((ord($utf8[$i++]) & 0x3F) <<  0)
            ;
        } elseif ((ord($utf8[$i]) & 0xFE) === 0xFC) {
            // 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
            $n =
                ((ord($utf8[$i++]) & 0x01) << 30) |
                ((ord($utf8[$i++]) & 0x3F) << 24) |
                ((ord($utf8[$i++]) & 0x3F) << 18) |
                ((ord($utf8[$i++]) & 0x3F) << 12) |
                ((ord($utf8[$i++]) & 0x3F) <<  6) |
                ((ord($utf8[$i++]) & 0x3F) <<  0)
            ;
        } else {
            throw new \Exception('Invalid utf-8 code point');
        }

        $n = strtoupper(dechex($n));
        $pad = strlen($n) <= 4 ? strlen($n) + strlen($n) %2 : 0;
        $n = str_pad($n, $pad, "0", STR_PAD_LEFT);

        $out .= sprintf("\u%s", $n);
    }

    return $out;
}

And in your case

php > var_dump(utf8_to_unicode("\xF0\x9F\x98\x83"));
string(7) "\u1F603"
Community
  • 1
  • 1
kgilden
  • 10,336
  • 3
  • 50
  • 48
  • 3
    Please, oh _please_, call it `utf8_to_utf16`. Both are "Unicode" in the way that both are representations for Unicode code points. – DarkDust May 10 '15 at 07:01
  • @DarkDust Why "utf16"? It doesn't produce UTF-16 code units. It arguably doesn't produce UTF-32 either because it performs no validation. – 一二三 May 10 '15 at 07:49
  • I'd suggest a few other names (none of which are nice) for this. For example, it doesn't validate continuation bytes and accepts up to six bytes for a single codepoint, both of which violates UTF-8. Also, the output is surely not UTF-16, because that would require at least two characters of 16 bit each to represent the char. I would say using "iconv" instead would be a better alternative. – Ulrich Eckhardt May 10 '15 at 07:53
  • The more I look at this function, the more WTFs crop up. The [`\u` escape sequence produces UTF-8 sequences](https://wiki.php.net/rfc/unicode_escape). So, this function takes a UTF-8 encoded character and outputs a _string_ like `\u1234`, which in turn would evaluate to an UTF-8 sequence when used as a printf format? What's the point of this? – DarkDust May 10 '15 at 13:27
  • @DarkDust it does indeed lack some documentation and the naming is off. What this function "tries" to do is to convert a UTF-8 encoded sequence of characters into their respective *literal* unicode code points. I was not aware of `\u`'s special meaning unfortunately. Notice how the code point *must* be wrapped in curly braces for PHP to interpret it as a codepoint. So the resemblance is unfortunately coincidental. The name should probably be along the lines of `utf8_to_literal_unicode_code_points`. – kgilden May 10 '15 at 14:33
1

Use a combination of:

  1. stripcslashes() to convert \xFF byte escapes.
    That'll result in a string of UTF-8, because that's what it seemingly was originally.

  2. json_encode() to convert "" back to an \uFFFF Unicode escape.
    If that's what you want to end up with. (Not enough context in your question to tell.)

mario
  • 144,265
  • 20
  • 237
  • 291
  • 3Q. but it does not convert "\xF0\x9F\x98\x83" to "\u1F603" by `json_encode(stripcslashes("\xF0\x9F\x98\x83"))`, the result is "\ud83d\ude03"; In this page [WordPress smilies_init()](https://developer.wordpress.org/reference/functions/smilies_init/) you can find an array -> `$wpsmiliestrans`; i need the value convert to a unicode, so i can create a image link like "twemoji.maxcdn.com/36x36/2764.png", the `2764` is part of unicode. sorry foy my bad english. – Cople May 10 '15 at 07:37