0

Possible Duplicate:
How to convert text to unicode code point like \u0054\u0068\u0069\u0073 using php?

I'm trying to convert all characters that can't fit into a 7-bit ANSI character into an escaped form, \uN, where N is its decimal value. Here's what I've come up with:

private static function escape($str) {
    return preg_replace_callback('~[\\x{007F}-\\x{FFFF}]~u',function($m){return '\\u'.ord($m[0]);},$str);
}

I've tried it with characters like Gamma,

echo self::escape('Γ');

But I get \u206 back out instead of \u915. I can't figure out where I'm going wrong... ideas?

Actually, it appears that either the ord() function doesn't give me the value or I want, or maybe the encoding on my .php file is wrong?

Community
  • 1
  • 1
mpen
  • 272,448
  • 266
  • 850
  • 1,236
  • Should have read the first comment on the `ord` page; http://ca3.php.net/manual/en/function.ord.php – mpen Oct 11 '12 at 20:37
  • 1
    That is *a* way to do it, but it is a horrible way to do it, just a minute while I dig out the *right* way to do it. – DaveRandom Oct 11 '12 at 20:39
  • @DaveRandom: There's another one at the very bottom of the page from 2004 that seems to work. – mpen Oct 11 '12 at 20:41
  • @mario: I think my solution looks like a cleaner, more efficient version of the one found on that page :) – mpen Oct 11 '12 at 20:45
  • 1
    Been using something similar: `return preg_replace("/[^\\x{0020}-\\x{007F}]/ue", "'\\u'.current(unpack('H*', iconv('UTF-8', 'UCS-2BE', '$0')))", $var);` – mario Oct 11 '12 at 20:50
  • @Mark I can't find my clip annoyingly. Basically you just need to do the opposite of what I did [here](http://stackoverflow.com/a/10645053/889949), if you can wait a few minutes I'll knock it up again. – DaveRandom Oct 11 '12 at 20:51
  • @mario: Nice one mario! I need decimal instead of hex though. Does unpack allow you to do that? I'll check... – mpen Oct 11 '12 at 20:58
  • Through trial and error, it's `n*` – mpen Oct 11 '12 at 21:00
  • Sure about that? That's an uncommon notation. But `hexdec()` wrapping might suffice. – mario Oct 11 '12 at 21:00
  • @mario: Yep.. I'm sure. RTF 1.5 spec. – mpen Oct 11 '12 at 21:01

1 Answers1

4

I had to refresh my memory on exactly how UTF-8 works, but here is a utf8_ord() function, and a complementing utf8_chr(). The chr() is lifted pretty much verbatim from my answer here.

function utf8_ord ($chr)
{
    $bytes = array_values(unpack('C*', $chr));

    switch (count($bytes)) {
        case 1:
            return $bytes[0] < 0x80
                ? $bytes[0]
                : false;
        case 2:
            return ($bytes[0] & 0xE0) === 0xC0 && ($bytes[1] & 0xC0) === 0x80
                ? (($bytes[0] & 0x1F) << 6) | ($bytes[1] & 0x3F)
                : false;
        case 3:
            return ($bytes[0] & 0xF0) === 0xE0 && ($bytes[1] & 0xC0) === 0x80 && ($bytes[2] & 0xC0) === 0x80 
                ? (($bytes[0] & 0x0F) << 12) | (($bytes[1] & 0x3F) << 6) | ($bytes[2] & 0x3F)
                : false;
        case 4:
            return ($bytes[0] & 0xF8) === 0xF0 && ($bytes[1] & 0xC0) === 0x80 && ($bytes[2] & 0xC0) === 0x80 && ($bytes[3] & 0xC0) === 0x80
                ? (($bytes[0] & 0x07) << 18) | (($bytes[1] & 0x3F) << 12) | (($bytes[2] & 0x3F) << 6) | ($bytes[3] & 0x3F)
                : false;
    }

    return false;
}

function utf8_chr ($ord)
{
    switch (true) {
        case $ord < 0x80:
            return pack('C*', $ord & 0x7F);
        case $ord < 0x0800:
            return pack('C*', (($ord & 0x07C0) >> 6) | 0xC0, ($ord & 0x3F) | 0x80);
        case $ord < 0x010000:
            return pack('C*', (($ord & 0xF000) >> 12) | 0xE0, (($ord & 0x0FC0) >> 6) | 0x80, ($ord & 0x3F) | 0x80);
        case $ord < 0x110000:
            return pack('C*', (($ord & 0x1C0000) >> 18) | 0xF0, (($ord & 0x03F000) >> 12) | 0x80, (($ord & 0x0FC0) >> 6) | 0x80, ($ord & 0x3F) | 0x80);
    }

    return false;
}
Community
  • 1
  • 1
DaveRandom
  • 87,921
  • 11
  • 154
  • 174
  • First time I've seen a `switch(true)`; neat. Regarding your `utf8_ord` -- why unpack into characters when you can unpack directly into a decimal using `n*`? – mpen Oct 12 '12 at 01:09
  • @Mark Because you need to examine each character. You need to extract a variable number of bits (5, 4 or 3) from the right of the first byte, and the 6 trailing bits from each subsequent byte. It's much simpler to deal with this 1 byte at a time. Unless I've missed something. Having said that, I may convert this to `N*` (you want a long for this as it may be 4 bytes) as it would be easier to validate that it really is a UTF-8 character. `switch (TRUE)` is like an elseif tree where you evaluate every expression as a boolean, in this specific case I think it's more readable, YMMV. – DaveRandom Oct 12 '12 at 08:06
  • @Mark In fact, [this](http://codepad.viper-7.com/iQZvue) is why you have to use chars and not shorts. Because anything that does not fit into a specific multibyte sequence is ignored by `unpack()` – DaveRandom Oct 12 '12 at 09:02
  • @Mark Validation has now been added to `utf8_ord()` – DaveRandom Oct 12 '12 at 09:30