108

I am writing a webservice that uses json to represent its resources, and I am a bit stuck thinking about the best way to encode the json. Reading the json rfc (http://www.ietf.org/rfc/rfc4627.txt) it is clear that the preferred encoding is utf-8. But the rfc also describes a string escaping mechanism for specifying characters. I assume this would generally be used to escape non-ascii characters, thereby making the resulting utf-8 valid ascii.

So let's say I have a json string that contains unicode characters (code-points) that are non-ascii. Should my webservice just utf-8 encoding that and return it, or should it escape all those non-ascii characters and return pure ascii?

I'd like browsers to be able to execute the results using jsonp or eval. Does that effect the decision? My knowledge of various browser's javascript support for utf-8 is lacking.

EDIT: I wanted to clarify that my main concern about how to encode the results is really about browser handling of the results. What I've read indicates that browsers may be sensitive to the encoding when using JSONP in particular. I haven't found any really good info on the subject, so I'll have to start doing some testing to see what happens. Ideally I'd like to only escape those few characters that are required and just utf-8 encode the results.

thomasrutter
  • 114,488
  • 30
  • 148
  • 167
schickb
  • 1,889
  • 3
  • 15
  • 19

6 Answers6

107

The JSON spec requires UTF-8 support by decoders. As a result, all JSON decoders can handle UTF-8 just as well as they can handle the numeric escape sequences. This is also the case for Javascript interpreters, which means JSONP will handle the UTF-8 encoded JSON as well.

The ability for JSON encoders to use the numeric escape sequences instead just offers you more choice. One reason you may choose the numeric escape sequences would be if a transport mechanism in between your encoder and the intended decoder is not binary-safe.

Another reason you may want to use numeric escape sequences is to prevent certain characters appearing in the stream, such as <, & and ", which may be interpreted as HTML sequences if the JSON code is placed without escaping into HTML or a browser wrongly interprets it as HTML. This can be a defence against HTML injection or cross-site scripting (note: some characters MUST be escaped in JSON, including " and \).

Some frameworks, including PHP's json_encode() (by default), always do the numeric escape sequences on the encoder side for any character outside of ASCII. This is a mostly unnecessary extra step intended for maximum compatibility with limited transport mechanisms and the like. However, this should not be interpreted as an indication that any JSON decoders have a problem with UTF-8.

So, I guess you just could decide which to use like this:

  • Just use UTF-8, unless any software you are using for storage or transport between encoder and decoder isn't binary-safe.

  • Otherwise, use the numeric escape sequences.

thomasrutter
  • 114,488
  • 30
  • 148
  • 167
  • 1
    "all JSON decoders can handle UTF-8" While this is true of browsers, just because the standard requires it doesn't mean all software decoding JSON supports UTF-8. – Michael Mior Jun 03 '19 at 15:05
  • 19
    "All JSON decoders can handle UTF-8" is literally true. If something can't accept UTF-8, it's not a JSON decoder. It's may be similar to a JSON decoder, but it definitely isn't one. – thomasrutter Jun 04 '19 at 13:03
  • I guess that depends on what definition of JSON decoder you're using, but fair point :) – Michael Mior Jun 04 '19 at 15:58
  • The reason RFC 8259 specifies UTF-8 support as mandatory is that it's what the world standardized on. Previous obsolete specs defined strings as Unicode but didn't specify which encoding; implementations standardised on UTF-8 anyway and the updated spec reflects that. – thomasrutter Jun 04 '19 at 22:03
  • UTF-8 support isn't specified as mandatory in that RFC for any particular software as far as I can tell. The only mention of UTF-8 is that it must be used as the encoding for JSON exchanged outside of a closed system. This does not imply that all JSON decoders (a language not used in the RFC) must support UTF-8. – Michael Mior Jun 05 '19 at 16:47
  • Yes if you want to use a version of JSON internally without needing to exchange it with any other system you are free to use any character encoding, as long as you accept that a JSON implementation incapable of understanding JSON from any other system may be of limited use. – thomasrutter Jun 06 '19 at 02:41
  • 1
    The official proposed schema for JSON specifies a JSON string as “A string of Unicode code points”. This means a string of 32-bit values. In fact, UTF-8 isn't even mentioned in http://json-schema.org/draft/2019-09/json-schema-core.html . – David Spector Sep 17 '20 at 17:04
  • 2
    @DavidSpector wrong document - you're looking at the proposal for the media type application/schema+json, that's not where JSON is defined. When referring to encoding it says encoding for the schema is identical to in JSON, and references the JSON spec at: https://tools.ietf.org/html/rfc8259 where it is defined that JSON MUST use UTF-8 any time it's used outside of a closed ecosystem. – thomasrutter Sep 18 '20 at 04:24
  • 1
    Thank you for the correction! I panicked when I saw "a string of Unicode code points" because this is going backwards to fixed-length characters. – David Spector Sep 18 '20 at 17:05
17

I had a problem there. When I JSON encode a string with a character like "é", every browsers will return the same "é", except IE which will return "\u00e9".

Then with PHP json_decode(), it will fail if it find "é", so for Firefox, Opera, Safari and Chrome, I've to call utf8_encode() before json_decode().

Note : with my tests, IE and Firefox are using their native JSON object, others browsers are using json2.js.

Tim Tisdall
  • 9,914
  • 3
  • 52
  • 82
  • 10
    Probably you meant `utf8_encode()`, http://php.net/manual/en/function.utf8-encode.php – Binyamin Nov 28 '10 at 11:27
  • 4
    If IE is failing to decode that, it's a bug in whatever JSON decoder you're using. All JSON decoders must successfully decode the encoded form, or they're not a JSON decoder. As for your issue with json_decode() with the é unescaped, it's possible that the text you're feeding it isn't UTF-8. JSON decoders always assume UTF-8, even the PHP implementation, even though PHP doesn't normally assume UTF-8 in many other functions. There are other character encodings which can include an é unescaped and look identical on screen, but which aren't UTF-8. Encoding in \uXXXX form is a workaround to this. – thomasrutter Jan 23 '13 at 02:56
  • Just saying: JSON can legally come in any Unicode encoding (UTF-8, UTF-16 BE/LE, UTF32 BE/LE, with or without byte order marker). And since ASCII is a subset of UTF-8, it can also come in ASCII. Whether parsers accept UTF-32 for example, I don't know. – gnasher729 Sep 01 '16 at 23:09
  • 1
    That is correct, and parsers aren't required to support anything other than UTF-8. From the spec: "JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations; there are many implementations that cannot successfully read texts in other encodings (such as UTF-16 and UTF-32). Implementations MUST NOT add a byte order mark to the beginning of a JSON text." – thomasrutter Oct 19 '17 at 23:09
  • @thomasrutter The spec you quoted is old. The [current spec](https://tools.ietf.org/html/rfc8259) says: "*JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8. Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON-based software implementations have chosen to use the UTF-8 encoding, to the extent that it is the only encoding that achieves interoperability. Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text.*" – Remy Lebeau Apr 24 '19 at 01:47
  • FWIW I wrote my above comment before that spec came out... But mandating UTF-8 as the only accepted encoding for JSON was always the only sensible way forward, so this is as it should be (and does not affect my answer). – thomasrutter Apr 24 '19 at 03:44
14

ASCII isn't in it any more. Using UTF-8 encoding means that you aren't using ASCII encoding. What you should use the escaping mechanism for is what the RFC says:

All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F)

chaos
  • 122,029
  • 33
  • 303
  • 309
  • 1
    If read that quote you provided you'll see that you are not required to escape all unicode characters, only a few special characters. But you are required to encode the results (preferably with utf-8). So the question is: "Why bother escaping normal unicode characters if you're utf-8 encoding". – schickb Feb 24 '09 at 21:20
  • Also, an ascii encoded string is a pure subset of utf-8. If I use json's escaping for all non-ascii characters, the result is ascii -- and therefore utf-8. Various json libraries (like python simplejson) have modes to force ascii results. I presume for a reason, like perhaps execution in browsers. – schickb Feb 24 '09 at 21:26
  • When you bother escaping normal unicode characters is in contexts where they're metacharacters, like strings. (The RFC chunk I quoted is about strings; sorry, wasn't clear about that.) You don't need to do ASCII output all the time; I'd think that's more for debugging with broken browsers. – chaos Feb 24 '09 at 21:34
7

Reading the json rfc (http://www.ietf.org/rfc/rfc4627.txt) it is clear that the preferred encoding is utf-8.

FYI, RFC 4627 is no longer the official JSON spec. It was obsoleted in 2014 by RFC 7159, which was then obsoleted in 2017 by RFC 8259, which is the current spec.

RFC 8259 states:

8.1. Character Encoding

JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629].

Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON-based software implementations have chosen to use the UTF-8 encoding, to the extent that it is the only encoding that achieves interoperability.

Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

Community
  • 1
  • 1
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
6

I was facing the same problem. It works for me. Please check this.

json_encode($array,JSON_UNESCAPED_UNICODE);
Tobi Nary
  • 4,566
  • 4
  • 30
  • 50
Ankit Sewadik
  • 121
  • 1
  • 1
  • 1
    It should be noted that the above is PHP, since the question is in no way PHP-specific and only talks about *web service* which also *may not* use PHP (as the older ones of our readers may still remember…) – ntninja Mar 05 '19 at 21:06
-1

I had a similar problem with é char... I think the comment "it's possible that the text you're feeding it isn't UTF-8" is probably close to the mark here. I have a feeling the default collation in my instance was something else until I realized and changed to utf8... problem is the data was already there, so not sure if it converted the data or not when i changed it, displays fine in mysql workbench. End result is that php will not json encode the data, just returns false. Doesn't matter what browser you use as its the server causing my issue, php will not parse the data to utf8 if this char is present. Like i say not sure if it is due to converting the schema to utf8 after data was present or just a php bug. In this case use json_encode(utf8_encode($string));

Paul Smith
  • 151
  • 1
  • 6