17

I have a simple test page in UTF-8 where text with letters in multiple different languages gets stringified to JSON:

http://jsfiddle.net/Mhgy5/

HTML:

<textarea id="txt">
検索 • Busca • Sök • 搜尋 • Tìm kiếm • Пошук • Cerca • Søk • Haku • Hledání • Keresés • 찾기 • Cari • Ara • جستجو • Căutare • بحث • Hľadať • Søg • Serĉu • Претрага • Paieška • Poišči • Cari • חיפוש • Търсене • Іздеу • Bilatu • Suk • Bilnga • Traži • खोजें
</textarea>
<button id="encode">Encode</button>
<pre id="out">
</pre>

JavaScript:

​$("#encode").click(function () {
    $("#out").text(JSON.stringify({ txt: $("#txt").val() }));
}).click();
​

While I expect the non-ASCII characters to be escaped as \uXXXX as per the JSON spec, they seem to be untouched. Here's the output I get from the above test:

{"txt":"検索 • Busca • Sök • 搜尋 • Tìm kiếm • Пошук • Cerca • Søk • Haku • Hledání • Keresés • 찾기 • Cari • Ara • جستجو • Căutare • بحث • Hľadať • Søg • Serĉu • Претрага • Paieška • Poišči • Cari • חיפוש • Търсене • Іздеу • Bilatu • Suk • Bilnga • Traži • खोजें\n"}

I'm using Chrome, so it should be the native JSON.stringify implementation. The page's encoding is UTF-8. Shouldn't the non-ASCII characters be escaped?

What brought me to this test in the first place is, I noticed that jQuery.ajax doesn't seem to escape non-ASCII characters when they appear in a data object property. The characters seem to be transmitted as UTF-8.

athspk
  • 6,722
  • 7
  • 37
  • 51
Ates Goral
  • 137,716
  • 26
  • 137
  • 190
  • 1
    I don't think your assertion that every non-ASCII character *must be transformed into an escape sequence* is accurate, or even anywhere close to the truth. – Kerrek SB Sep 04 '12 at 21:27
  • 2
    possible duplicate of [JSON and escaping characters](http://stackoverflow.com/questions/4901133/json-and-escaping-characters) – James Montagne Sep 04 '12 at 21:28

5 Answers5

38

The JSON spec does not demand the conversion from unicode characters to escape-sequences. "Any UNICODE character except " or \ or control character." is defined to be a valid JSON-serialized string:

json string format

Rob W
  • 341,306
  • 83
  • 791
  • 678
  • 2
    Just because it's not _demanded_ by the spec does not mean that it's unworthy of implementation. In fact, the `\uXXXX` format is right there at the bottom and is frequently required for interoperability with external services and/or transports that are not "safe" beyond 7bit representations. The fact that JS's native JSON encoder is fundamentally incapable of producing output that complies with its own spec is laughable, and the various workarounds for this are themselves frequently contributors to the exact same problems further into their respective stacks. – Sammitch Apr 07 '21 at 20:12
10

Indeed JSON.stringify does not escape utf8:

JSON.stringify({a:"Привет!"})
{"a":"Привет!"}

But I had an issue when stroring that JSON via Perl DBD::Mysql and then retrieving it back. I found it is safer to follow reccomendation to escape all non-ascii and non-visible characters by \uXXXX. Here is how

function jsonEscapeUTF(s) {return s.replace(/[^\x20-\x7F]/g, x => "\\u" + ("000"+x.codePointAt(0).toString(16)).slice(-4))}

jsonEscapeUTF(JSON.stringify({a:"Привет!"}))
"{"a":"\u041f\u0440\u0438\u0432\u0435\u0442!"}"

Hopefully it will be helpful.

okharch
  • 387
  • 2
  • 10
  • Note that this incidentally works only because the regexes don't do Unicode-aware matching by default (eg "".codePointAt(0) is 128190, and so the `/[^\x20-\x7F]/gu` regex doesn't work with emojis anymore) – wizzard0 Feb 01 '23 at 11:28
5

The short answer for your question is NO; JSON.stringify shouldn't escape your string.

Although, handling utf8 strings can seem strange if you save your HTML file with utf-8 encoding but don't declare it to be an utf8 file.

For example:

<!doctype html>
<html>
    <head>
        <title></title>
        <script>
            var data="árvíztűrő tükörfúrógép ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP";
            alert(JSON.stringify(data));
        </script>
    </head>
</html>

This would alert "árvíztűrÅ‘ tükörfúrógép ÃRVÃZTŰRÅ TÜKÖRFÚRÓGÉP".

But if you add the following line to the header:

<meta charset="UTF-8">

Then, the alert will be what one could expect: "árvíztűrő tükörfúrógép ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP".

nyedidikeke
  • 6,899
  • 7
  • 44
  • 59
Csongor Halmai
  • 3,239
  • 29
  • 30
3

No. The preferred encoding for JSON is UTF-8, so those characters do not need to be escaped.

You are allowed to escape unicode characters if you want to be safer or explicitly send the JSON in a different encoding (that is, pure ASCII), but it is against recommendations.

GolezTrol
  • 114,394
  • 18
  • 182
  • 210
1

Your claim is just not true. JSON strings consist of unicode codepoints (except '"' and '\'), that's all. The entire JSON document can be encoded in UTF-8, UTF-16 or UTF-32, at the discretion of the producer. Additionally, strings can contain escape sequences which provide an alternative form of naming code points, alternative to including them literally.

If the distinction between the two still eludes you, here's an example of two different ways of writing the same string in JSON:

  • "A"

  • "\u0041"

Both versions represent the same string, which consists of the single codepoint U+41, which is A.

Kerrek SB
  • 464,522
  • 92
  • 875
  • 1,084