121

Is there a function in PHP that can decode Unicode escape sequences like "\u00ed" to "í" and all other similar occurrences?

I found similar question here but is doesn't seem to work.

Community
  • 1
  • 1
Docstero
  • 1,287
  • 3
  • 11
  • 6

8 Answers8

207

Try this:

$str = preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/', function ($match) {
    return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE');
}, $str);

In case it's UTF-16 based C/C++/Java/Json-style:

$str = preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/', function ($match) {
    return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UTF-16BE');
}, $str);
Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • 2
    @Docstero: The regular expression will match any sequence of `\u` followed by four hexadecimal digits. – Gumbo May 29 '10 at 10:42
  • Warning: preg_replace_callback() [function.preg-replace-callback]: Compilation failed: PCRE does not support \L, \l, \N, \U, or \u at offset 1 – Docstero May 29 '10 at 10:48
  • 9
    This function cannot deal with supplementary characters as they cannot be represented in UCS-2. – Artefacto Nov 18 '11 at 10:45
  • I wrapped this into a one-parameter function to make it more convenient: `` – MrFusion Jan 09 '13 at 22:27
  • @MrFusion - Just fyi, since a lot of people may be interested in using this to correct json_decode output before the JSON_UNESCAPED_UNICODE option became available in 5.4. Your anonymous function will only work in 5.3+. So there's a pretty small window of versions where it would work and be useful for that specific problem. – DougW Feb 01 '13 at 19:16
  • You could of course use 'create_function', but that would be using eval, which I'm sure nobody here would ever do. – DougW Feb 01 '13 at 19:22
  • Gumbo you are just great. I have being struggling with this problem for hours. – Muhammad Babar Aug 17 '14 at 08:03
  • This is nice but for older PHP i get T_FUNCTION error because of this function inside function. Is there a way to fix it? – Marcin Majchrzak Apr 17 '15 at 01:36
  • 4
    @gumbo How do you call or use this function? – Demodave May 18 '15 at 14:46
  • This helps so much! A shame it doesn't capture supplementary characters with the new iOS 10 emoticons, but it's damn close! – ChristoKiwi Oct 13 '16 at 04:11
  • The json_decode function below works far better, clear concise, and fast. – Nico Westerdale Jul 03 '17 at 19:14
  • 3
    I found my way here as I had \u00ed in my output, but I was looking at the output with json_encode() and funnily enough the default json_encode() will trash up the output so use json_encode($theDict,JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE); – Tom Andersen Sep 26 '17 at 00:32
  • This is an outdated code that doesn't work for `surrogate pair` Unicode chars, such as . Here's the code that worked for me instead: https://stackoverflow.com/a/27975110/843732 – c00000fd Apr 22 '20 at 20:47
83
print_r(json_decode('{"t":"\u00ed"}')); // -> stdClass Object ( [t] => í )
2BJ
  • 858
  • 6
  • 6
  • 56
    It doesn't even need the object wrapper: `json_decode('"' . $text . '"')` – deceze May 15 '13 at 12:15
  • 5
    Thanks. **This seems to be STANDARD WAY**, rather then accepted answer. – T.Todua Nov 25 '16 at 08:36
  • 1
    Interestingly, this also works for complex entities like smiley faces... `json_decode('{"t":"\uD83D\uDE0A"}')` is – DynamicDan Oct 23 '17 at 05:38
  • 3
    @deceze you should include the fact that `$text` can include double quotes. So a revised version would be: `json_decode('"'.str_replace('"', '\\"', $text).'"')`. Thanks for your help :-) – Yvan Oct 24 '18 at 04:12
  • 1
    The comment beats all the answers – Stavros Oct 12 '22 at 19:34
26

PHP 7+

As of PHP 7, you can use the Unicode codepoint escape syntax to do this.

echo "\u{00ed}"; outputs í.

Rabin Lama Dong
  • 2,422
  • 1
  • 27
  • 33
17
$str = '\u0063\u0061\u0074'.'\ud83d\ude38';
$str2 = '\u0063\u0061\u0074'.'\ud83d';

// U+1F638
var_dump(
    "cat\xF0\x9F\x98\xB8" === escape_sequence_decode($str),
    "cat\xEF\xBF\xBD" === escape_sequence_decode($str2)
);

function escape_sequence_decode($str) {

    // [U+D800 - U+DBFF][U+DC00 - U+DFFF]|[U+0000 - U+FFFF]
    $regex = '/\\\u([dD][89abAB][\da-fA-F]{2})\\\u([dD][c-fC-F][\da-fA-F]{2})
              |\\\u([\da-fA-F]{4})/sx';

    return preg_replace_callback($regex, function($matches) {

        if (isset($matches[3])) {
            $cp = hexdec($matches[3]);
        } else {
            $lead = hexdec($matches[1]);
            $trail = hexdec($matches[2]);

            // http://unicode.org/faq/utf_bom.html#utf16-4
            $cp = ($lead << 10) + $trail + 0x10000 - (0xD800 << 10) - 0xDC00;
        }

        // https://tools.ietf.org/html/rfc3629#section-3
        // Characters between U+D800 and U+DFFF are not allowed in UTF-8
        if ($cp > 0xD7FF && 0xE000 > $cp) {
            $cp = 0xFFFD;
        }

        // https://github.com/php/php-src/blob/php-5.6.4/ext/standard/html.c#L471
        // php_utf32_utf8(unsigned char *buf, unsigned k)

        if ($cp < 0x80) {
            return chr($cp);
        } else if ($cp < 0xA0) {
            return chr(0xC0 | $cp >> 6).chr(0x80 | $cp & 0x3F);
        }

        return html_entity_decode('&#'.$cp.';');
    }, $str);
}
masakielastic
  • 4,540
  • 1
  • 39
  • 42
2

This is a sledgehammer approach to replacing raw UNICODE with HTML. I haven't seen any other place to put this solution, but I assume others have had this problem.

Apply this str_replace function to the RAW JSON, before doing anything else.

function unicode2html($str){
    $i=65535;
    while($i>0){
        $hex=dechex($i);
        $str=str_replace("\u$hex","&#$i;",$str);
        $i--;
     }
     return $str;
}

This won't take as long as you think, and this will replace ANY unicode with HTML.

Of course this can be reduced if you know the unicode types that are being returned in the JSON.

For example my code was getting lots of arrows and dingbat unicode. These are between 8448 an 11263. So my production code looks like:

$i=11263;
while($i>08448){
    ...etc...

You can look up the blocks of Unicode by type here: http://unicode-table.com/en/ If you know you're translating Arabic or Telegu or whatever, you can just replace those codes, not all 65,000.

You could apply this same sledgehammer to simple encoding:

 $str=str_replace("\u$hex",chr($i),$str);
Nemo Noman
  • 202
  • 3
  • 7
1

There is also a solution:
http://www.welefen.com/php-unicode-to-utf8.html

function entity2utf8onechar($unicode_c){
    $unicode_c_val = intval($unicode_c);
    $f=0x80; // 10000000
    $str = "";
    // U-00000000 - U-0000007F:   0xxxxxxx
    if($unicode_c_val <= 0x7F){         $str = chr($unicode_c_val);     }     //U-00000080 - U-000007FF:  110xxxxx 10xxxxxx
    else if($unicode_c_val >= 0x80 && $unicode_c_val <= 0x7FF){         $h=0xC0; // 11000000
        $c1 = $unicode_c_val >> 6 | $h;
        $c2 = ($unicode_c_val & 0x3F) | $f;
        $str = chr($c1).chr($c2);
    } else if($unicode_c_val >= 0x800 && $unicode_c_val <= 0xFFFF){         $h=0xE0; // 11100000
        $c1 = $unicode_c_val >> 12 | $h;
        $c2 = (($unicode_c_val & 0xFC0) >> 6) | $f;
        $c3 = ($unicode_c_val & 0x3F) | $f;
        $str=chr($c1).chr($c2).chr($c3);
    }
    //U-00010000 - U-001FFFFF:  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    else if($unicode_c_val >= 0x10000 && $unicode_c_val <= 0x1FFFFF){         $h=0xF0; // 11110000
        $c1 = $unicode_c_val >> 18 | $h;
        $c2 = (($unicode_c_val & 0x3F000) >>12) | $f;
        $c3 = (($unicode_c_val & 0xFC0) >>6) | $f;
        $c4 = ($unicode_c_val & 0x3F) | $f;
        $str = chr($c1).chr($c2).chr($c3).chr($c4);
    }
    //U-00200000 - U-03FFFFFF:  111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    else if($unicode_c_val >= 0x200000 && $unicode_c_val <= 0x3FFFFFF){         $h=0xF8; // 11111000
        $c1 = $unicode_c_val >> 24 | $h;
        $c2 = (($unicode_c_val & 0xFC0000)>>18) | $f;
        $c3 = (($unicode_c_val & 0x3F000) >>12) | $f;
        $c4 = (($unicode_c_val & 0xFC0) >>6) | $f;
        $c5 = ($unicode_c_val & 0x3F) | $f;
        $str = chr($c1).chr($c2).chr($c3).chr($c4).chr($c5);
    }
    //U-04000000 - U-7FFFFFFF:  1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    else if($unicode_c_val >= 0x4000000 && $unicode_c_val <= 0x7FFFFFFF){         $h=0xFC; // 11111100
        $c1 = $unicode_c_val >> 30 | $h;
        $c2 = (($unicode_c_val & 0x3F000000)>>24) | $f;
        $c3 = (($unicode_c_val & 0xFC0000)>>18) | $f;
        $c4 = (($unicode_c_val & 0x3F000) >>12) | $f;
        $c5 = (($unicode_c_val & 0xFC0) >>6) | $f;
        $c6 = ($unicode_c_val & 0x3F) | $f;
        $str = chr($c1).chr($c2).chr($c3).chr($c4).chr($c5).chr($c6);
    }
    return $str;
}
function entities2utf8($unicode_c){
    $unicode_c = preg_replace("/\&\#([\da-f]{5})\;/es", "entity2utf8onechar('\\1')", $unicode_c);
    return $unicode_c;
}
bummi
  • 27,123
  • 14
  • 62
  • 101
jianyong
  • 41
  • 4
1

fix json values, it's add \ before u{xxx} to all +" "

  $item = preg_replace_callback('/"(.+?)":"(u.+?)",/', function ($matches) {
        $matches[2] = preg_replace('/(u)/', '\u', $matches[2]);
            $matches[2] = preg_replace('/(")/', '&quot;', $matches[2]); 
            $matches[2] = json_decode('"' . $matches[2] . '"'); 
            return '"' . $matches[1] . '":"' . $matches[2] . '",';
        }, $item);
orel
  • 9
  • 2
0

There is a very simple and beautiful solution.

If we want to decode Unicode escape sequences like "\u00ed" to "í" we may use simple function json_decode:

$a="\u00ed";
echo json_decode("\"$a\"");
# output: í

It works because json_encode encodes all non utf-8 symbols to \u**** sequence:

echo json_encode("í");
# output: "\u00ed"

This is a little continuation of the solution https://stackoverflow.com/a/7981441/5599052.

Sergey Yurich
  • 51
  • 1
  • 4