Encode emoji to unicode code point - PHP/JS

Question

Let's say I have this string on PHP:

$str = '️';

Or this string on JavaScript:

var str = '️';

If I do a utf8_encode($str) the result is \ud83c\udc04\ufe0f, but I want it to be 1F004 or 1f004 or \u1f004 in order to look for an image file that matches that character.

I have done many many online searches looking for a way to encode it, I have found that there are many places where same terms are used for very different things, it looks like what I want to is to "encode" a string to UTF-32 code point but I really don't know how to name what I want, I just want to convert this ️ into this 1f004 using PHP and/or JavaScript.

http://www.fileformat.info/info/unicode/char/1f004/index.htm

Thanks.

score 6 · Answer 1 · answered Oct 09 '15 at 07:16

JavaScript function:

function e2u(str){
    str = str.replace(/\ufe0f|\u200d/gm, ''); // strips unicode variation selector and zero-width joiner
    var i = 0, c = 0, p = 0, r = [];
    while (i < str.length){
        c = str.charCodeAt(i++);
        if (p){
            r.push((65536+(p-55296<<10)+(c-56320)).toString(16));
            p = 0;
        } else if (55296 <= c && c <= 56319){
            p = c;
        } else {
            r.push(c.toString(16));
        }
    }
    return r.join('-');
}

score 3 · Accepted Answer · edited May 23 '17 at 12:06

You want get the unicode code point from a stream of byte, so utf8_encode won't help. I've found an implementation here.

function utf8_to_unicode($c)
{
    $ord0 = ord($c{0}); if ($ord0>=0   && $ord0<=127) return $ord0;
    $ord1 = ord($c{1}); if ($ord0>=192 && $ord0<=223) return ($ord0-192)*64 + ($ord1-128);
    $ord2 = ord($c{2}); if ($ord0>=224 && $ord0<=239) return ($ord0-224)*4096 + ($ord1-128)*64 + ($ord2-128);
    $ord3 = ord($c{3}); if ($ord0>=240 && $ord0<=247) return ($ord0-240)*262144 + ($ord1-128)*4096 + ($ord2-128)*64 + ($ord3-128);
    return false;
}

var_dump( dechex(utf8_to_unicode('️')) ); // string(5) "1f004"

UTF-8 is compatible with the single byte ASCII encoding, so $ord0 = ord($c{0}); if ($ord0>=0 && $ord0<=127) return $ord0; is very easy. Code points larger than 127 are represented by multi-byte sequences. The next 1,920 characters need two bytes to be encoded, $ord1 = ord($c{1}); if ($ord0>=192 && $ord0<=223) return ($ord0-192)*64 + ($ord1-128);. The first byte need to be between 192 (11000000) and 223 (11011111) to be well-formed. The second byte must be 10xxxxxx (that is from 128 to 191 in decimal). The first code point represented here here is U+0080, the last U+07FF.

And so on.

You can use this library in JS: https://github.com/mathiasbynens/jsesc and run like this: `jsesc('️', {'es6': true, 'escapeEverything': true});` — Rodrigo Polo, Sep 16 '15 at 03:53

Encode emoji to unicode code point - PHP/JS

2 Answers2