4

Let's say I have this string on PHP:

$str = '️';

Or this string on JavaScript:

var str = '️';

If I do a utf8_encode($str) the result is \ud83c\udc04\ufe0f, but I want it to be 1F004 or 1f004 or \u1f004 in order to look for an image file that matches that character.

I have done many many online searches looking for a way to encode it, I have found that there are many places where same terms are used for very different things, it looks like what I want to is to "encode" a string to UTF-32 code point but I really don't know how to name what I want, I just want to convert this into this 1f004 using PHP and/or JavaScript.

http://www.fileformat.info/info/unicode/char/1f004/index.htm

Thanks.

2 Answers2

6

JavaScript function:

function e2u(str){
    str = str.replace(/\ufe0f|\u200d/gm, ''); // strips unicode variation selector and zero-width joiner
    var i = 0, c = 0, p = 0, r = [];
    while (i < str.length){
        c = str.charCodeAt(i++);
        if (p){
            r.push((65536+(p-55296<<10)+(c-56320)).toString(16));
            p = 0;
        } else if (55296 <= c && c <= 56319){
            p = c;
        } else {
            r.push(c.toString(16));
        }
    }
    return r.join('-');
}
Rodrigo Polo
  • 4,314
  • 2
  • 26
  • 32
3

You want get the unicode code point from a stream of byte, so utf8_encode won't help. I've found an implementation here.

function utf8_to_unicode($c)
{
    $ord0 = ord($c{0}); if ($ord0>=0   && $ord0<=127) return $ord0;
    $ord1 = ord($c{1}); if ($ord0>=192 && $ord0<=223) return ($ord0-192)*64 + ($ord1-128);
    $ord2 = ord($c{2}); if ($ord0>=224 && $ord0<=239) return ($ord0-224)*4096 + ($ord1-128)*64 + ($ord2-128);
    $ord3 = ord($c{3}); if ($ord0>=240 && $ord0<=247) return ($ord0-240)*262144 + ($ord1-128)*4096 + ($ord2-128)*64 + ($ord3-128);
    return false;
}

var_dump( dechex(utf8_to_unicode('️')) ); // string(5) "1f004"

UTF-8 is compatible with the single byte ASCII encoding, so $ord0 = ord($c{0}); if ($ord0>=0 && $ord0<=127) return $ord0; is very easy. Code points larger than 127 are represented by multi-byte sequences. The next 1,920 characters need two bytes to be encoded, $ord1 = ord($c{1}); if ($ord0>=192 && $ord0<=223) return ($ord0-192)*64 + ($ord1-128);. The first byte need to be between 192 (11000000) and 223 (11011111) to be well-formed. The second byte must be 10xxxxxx (that is from 128 to 191 in decimal). The first code point represented here here is U+0080, the last U+07FF.

And so on.

Community
  • 1
  • 1
Federkun
  • 36,084
  • 8
  • 78
  • 90