6

I've been reading up on a few solutions but have not managed to get anything to work as yet.

I have a JSON string that I read in from an API call and it contains Unicode characters - \u00c2\u00a3 for example is the £ symbol.

I'd like to use PHP to convert these into either £ or £.

I'm looking into the problem and found the following code (using my pound symbol to test) but it didn't seem to work:

$title = preg_replace("/\\\\u([a-f0-9]{4})/e", "iconv('UCS-4LE','UTF-8',pack('V', hexdec('U$1')))", '\u00c2\u00a3');

The output is £.

Am I correct in thinking that this is UTF-16 encoded? How would I convert these to output as HTML?

UPDATE

It seems that the JSON string from the API has 2 or 3 unescaped Unicode strings, e.g.:

That\u00e2\u0080\u0099s (right single quotation)
\u00c2\u00a (pound symbol)
Pang
  • 9,564
  • 146
  • 81
  • 122
Alexander Holsgrove
  • 1,795
  • 3
  • 25
  • 54
  • 2
    It sounds like the encoding is broken at the other end of the API. `£` is what you typically get if you take UTF-8 encoded data and read it as ISO-8859-1. I guess that is happening somewhere in the API provider's system before the resulting string is then JSON encoded. A bit of a mess, really. The first port of call should be to notify the API provider and ask them to fix it. – SDC Jan 25 '13 at 17:29
  • Thanks SDC. I dropped them an email to say just that. Hopefully it will be updated soon, but perhaps that is wishful thinking! – Alexander Holsgrove Jan 25 '13 at 22:45

3 Answers3

11

It is not UTF-16 encoding. It rather seems like bogus encoding, because the \uXXXX encoding is independant of whatever UTF or UCS encodings for Unicode. \u00c2\u00a3 really maps to the £ string.

What you should have is \u00a3 which is the unicode code point for £.

{0xC2, 0xA3} is the UTF-8 encoded 2-byte character for this code point.

If, as I think, the software that encoded the original UTF-8 string to JSON was oblivious to the fact it was UTF-8 and blindly encoded each byte to an escaped unicode code point, then you need to convert each pair of unicode code points to an UTF-8 encoded character, and then decode it to the native PHP encoding to make it printable.

function fixBadUnicode($str) {
    return utf8_decode(preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2"))', $str));
}

Example here: http://phpfiddle.org/main/code/6sq-rkn

Edit:

If you want to fix the string in order to obtain a valid JSON string, you need to use the following function:

function fixBadUnicodeForJson($str) {
    $str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")).chr(hexdec("$4"))', $str);
    $str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3"))', $str);
    $str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2"))', $str);
    $str = preg_replace("/\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1"))', $str);
    return $str;
}

Edit 2: fixed the previous function to transform any wrongly unicode escaped utf-8 byte sequence into the equivalent utf-8 character.

Be careful that some of these characters, which probably come from an editor such as Word are not translatable to ISO-8859-1, therefore will appear as '?' after ut8_decode.

SirDarius
  • 41,440
  • 8
  • 86
  • 100
  • Thanks for this. Can I run that on the entire string before|after calling json_decode to save calling 'fixBadUnicode' multiple times. – Alexander Holsgrove Jan 25 '13 at 15:54
  • you can run it before json_decode, however be careful that this might lead your json string to contain illegal characters, see json.org for the list of characters that can exist in json strings. – SirDarius Jan 25 '13 at 15:58
  • If I run it on the raw JSON, it converts the '\u00c2\u00a3' to '�'. I also found \u0099 is left unchanged - I think this is an apostrophe. Seems like a really poor JSON data feed! – Alexander Holsgrove Jan 25 '13 at 16:03
  • That's great - thank you. I don't need the encoded JSON after it has been 'fixed' as I need to iterate through the data. Can I instead call json_decode and then preg_replace(...) without needing to call json_encode and the substr? – Alexander Holsgrove Jan 25 '13 at 16:14
  • @AlexHolsgrove I'm afraid no. `fixBadUnicodeForJson` will have to be called first on the raw json data, then use json_decode on the result, and you're good. – SirDarius Jan 25 '13 at 16:16
  • It seems to find more invalid UTF-8 data. I setup a demo here (where you can also see the raw JSON): http://phpfiddle.org/main/code/rfk-50n – Alexander Holsgrove Jan 25 '13 at 17:08
  • Do I need to run the 'fix' twice? I can't see how to get it to decode the json as it won't return the array. – Alexander Holsgrove Jan 28 '13 at 10:05
  • You need to take into account UTF-8 characters with more than two bytes... see my edit :) – SirDarius Jan 28 '13 at 13:17
  • 1
    preg_replace "e" is deprecated, can you write this in the format of "preg_replace_callback" ? – Hossein J Oct 31 '15 at 12:41
3

The output is correct.

\u00c2 == Â
\u00a3 == £

So nothing is wrong here. And converting to HTML entities is easy:

htmlentities($title);
Yo-han
  • 351
  • 2
  • 12
  • The first part is correct, but htmlentities($title) gives me Ã�£ – Alexander Holsgrove Jan 25 '13 at 14:48
  • the ouput is correct, but it is obvious that the software that encoded the original UTF-8 string to JSON was oblivious to the fact it was UTF-8 and blindly encoded each byte to an escaped unicode code point. – SirDarius Jan 25 '13 at 14:48
  • Just for reference, the JSON is from the Hot UK Deals API. I didn't want to mess about with the default XML feed type – Alexander Holsgrove Jan 25 '13 at 15:58
3

Here is an updated version of the function using preg_replace_callback instead of preg_replace.

function fixBadUnicodeForJson($str) {
    $str = preg_replace_callback(
    '/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
    function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")).chr(hexdec("$4")); },
    $str
);
    $str = preg_replace_callback(
    '/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
    function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")); },
    $str
);
    $str = preg_replace_callback(
    '/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
    function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")); },
    $str
);
    $str = preg_replace_callback(
    '/\\\\u00([0-9a-f]{2})/',
    function($matches) { return chr(hexdec("$1")); },
    $str
);
    return $str;
}
Yann Rimbaud
  • 151
  • 2