18

I have a problem with json_encode function with special characters.

For example I try this:

$string="Svrček";

echo "ENCODING=".mb_detect_encoding($string); //ENCODING=UTF-8

echo "JSON=".json_encode($string); //JSON="Svr\u010dek"

What can I do to display the string correctly, so JSON="Svrček"?

Thank you very much.

Programmer Bruce
  • 64,977
  • 7
  • 99
  • 97
epi82
  • 497
  • 2
  • 10
  • 21

3 Answers3

45

json_encode() is not actually outputting JSON* there. It’s outputting a javascript string. (It outputs JSON when you give it an object or an array to encode.) That’s fine, as a javascript string is what you want.

In javascript (and in JSON), č may be escaped as \u010d. The two are equivalent. So there’s nothing wrong with what json_encode() is doing. It should work fine. I’d be very surprised if this is actually causing you any form of problem. However, if the transfer is safely in a Unicode encoding (UTF-8, usually)†, there’s no need for it either. If you want to turn off the escaping, you can do so thus: json_encode('Svrček', JSON_UNESCAPED_UNICODE). Note that the flag JSON_UNESCAPED_UNICODE was introduced in PHP 5.4.0, and is unavailable in earlier versions.

By the way, contrary to what @onteria_ says, JSON does use UTF-8:

The character encoding of JSON text is always Unicode. UTF-8 is the only encoding that makes sense on the wire, but UTF-16 and UTF-32 are also permitted.


* Or, at least, it's not outputting JSON as defined in RFC 4627. However, there are other definitions of JSON, by which scalar values are allowed.

† JSON may be in UTF-8, UTF-16LE, UTF-16BE, UFT-32LE, or UTF-32BE.

Alexander Ushakov
  • 5,139
  • 3
  • 27
  • 50
TRiG
  • 10,148
  • 7
  • 57
  • 107
  • 4
    +1 for JSON_UNESCAPED_UNICODE – bizzr3 Jul 16 '14 at 17:30
  • What's the alternate for 5.2 – Muhammad Babar Aug 16 '14 at 21:27
  • Do you actually need `JSON_UNESCAPED_UNICODE`, @MuhammadBabar? If you're not using UTF-8, you don't *want* it. If you are using UTF-8, you still don't *need* it: using it will make your output *slightly* smaller, that's all. – TRiG Aug 17 '14 at 21:39
  • Yes i need it and i'm using UTF-8. The question was regarding escaping unicode to return the actual chatacters. Though i did found a solution. Thanks and cheers – Muhammad Babar Aug 18 '14 at 06:06
  • @MuhammadBabar What was the solution? – Ted Jan 26 '15 at 09:45
  • @Ted use this function `function replace_unicode_escape_sequence($match) { return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE');` like this `preg_replace_callback('/\\\\u([0-9a-f]{4})/i', 'replace_unicode_escape_sequence', $json);` also you may need to save your script file as `UTF-8` encoded! – Muhammad Babar Jan 27 '15 at 14:22
  • Your best option, @Ted, would be to upgrade PHP to a version which supports `JSON_UNESCAPED_UNICODE`. However, as I said before, the escaped version is perfectly valid and really shouldn't cause any problems. I'm curious why anyone would need to do this. – TRiG Jan 27 '15 at 15:08
  • 2
    Please note that `json_encode` does not emit escaped Unicode code points. For example `json_encode('Hello José ")` produces the horrible `"Hello Jos\u00e9 \ud83d\ude31"`. U+D83D and U+DE31 are not legal Unicode code points. So it is emitting the horrible, horrible UTF-16 code units. It has erred in confusing logical code points with physical encoding layouts, an abstraction violation seen again and again in places like Java and C# and Windows. – tchrist Apr 26 '15 at 21:04
  • @tchrist. [Related question](http://stackoverflow.com/q/38463038). – TRiG Jul 19 '16 at 15:45
  • You're one of those types that are completely right, but where you just hate the answer anyway. var_dump always showed it correctly, had no clue I had to configure json_encode to anything. Thanks – Christopher Bonitz Aug 28 '20 at 10:00
11

Ok, so, after you make database connection in your php script, put this line, and it should work, at least it solved my problem:

mysql_query('SET CHARACTER SET utf8');
Vulovic Vukasin
  • 1,540
  • 2
  • 21
  • 30
7

Yes, json_encode escapes non-ascii characters. If you decode it you'll get your original result:

$string="こんにちは";
echo "ENCODING: " . mb_detect_encoding($string) . "\n";
$encoded = json_encode($string);
echo "ENCODED JSON: $encoded\n";
$decoded = json_decode($encoded);
echo "DECODED JSON: $decoded\n";

Output:

ENCODING: UTF-8
ENCODED JSON: "\u3053\u3093\u306b\u3061\u306f"
DECODED JSON: こんにちは

EDIT: It's worth nothing that:

JSON uses Unicode exclusively.

The self-documenting format that describes structure and field names as well as specific values;

Source: http://www.json.org/fatfree.html

It uses Unicode NOT UTF-8. This FAQ Explains the difference between UTF-8 and Unicode:

http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

You use JSON, your non-ascii characters get escaped into Unicode code points. For example こ = code point 3053.

onteria_
  • 68,181
  • 7
  • 71
  • 64