Encoding JSON in UTF-16 or UTF-32

Question

The JSON RFC, section 2.5, says in part:

To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

Assume I have a valid reason to encode JSON as UTF-16BE (which is allowed). When doing so, is it still necessary to escape characters that are not in the Basic Multilingual Plane? E.g., instead of this:

00 5C 00 75 00 44 00 38 00 33 00 34 00 5C 00 75 00 44 00 44 00 31 00 45
  \     u     D     8     3     4     \     u     D     D     1     E

which is the 24-byte UTF-16BE byte sequence for \uD834\uDD1E, is it legal to do this:

D8 34 DD 1E

i.e., use the 4-byte UTF-16BE values directly?

Similarly, if I were to encode the same JSON string as UTF-32BE, could I simply use the code-point value directly:

00 01 D1 1E

?

Good question. I suspect that whatever the spec says, in the end it comes down to the support of whoever is parsing the JSON. — deceze, Jul 25 '12 at 06:01

score 19 · Accepted Answer · edited Jul 19 '16 at 19:47

19

As far as I can tell, yes, you can write the UTF-16 values directly. Support: the RFC paragraph you quoted explains how to escape arbitrary Unicode if you have decided to escape it. However, earlier in that same section, the RFC says

All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence...

(Emphasis added.)

To me, this says that only ", \ and control characters must be escaped, and that any other Unicode characters may be placed as-is directly into the JSON text (in whatever UTF form you are using). It also says to me that even if you're encoding as UTF-8, you don't need to use the \uXXXX form for any Unicode character other than ", \, and control characters.

(As an aside, this does make me wonder whether the \uXXXX form is actually useful for anything other than control characters. As the other poster said, it probably comes down to what your JSON parser actually supports.)

edited Jul 19 '16 at 19:47

Remy Lebeau

555,201
31
458
770

answered Jul 25 '12 at 16:25

Chris Hillery

431
3
4

4

+1. `\u` form has more use for JSONP than straight JSON, since (a) you can't be sure what `charset` the containing page is using and setting `charset` in the `Content-Type` of a ` – bobince Jul 25 '12 at 22:01
@bobince JSON is not only for communicating with javascript, it often used to communicate between systems implemented in other languages, therefore replicating all the limitations of javascript is counter-productive. is '\U+2028' really forbidden in javascript strings? I can see that it would be harmful in source code between identifiers but in string literals it should be harmless. – Jasen May 04 '18 at 03:07
1

Yes, U+2028 (and U+2029) is forbidden in JavaScript string literals, they're defined to be equivalent to a newline. If you're writing a JSON output library IMO it makes sense to replicate the limitations of popular languages to maximise interoperability. – bobince May 09 '18 at 11:08
@Jasen I mean... it is Java Script Object Notation. If it can't be used as-is for javascript objects, well, you've got a naming problem at the very least. Fundamentally, it's a POLA issue. – DylanYoung Jul 16 '18 at 20:05

score -4 · Answer 2 · edited May 13 '20 at 14:11

-4

There is one idea we explored and it got worked in Azure Datafactory. Convert the encoding format to US-ASCII in the sink part(Json File). Source remains the same REST API Link:

enter image description here

edited May 13 '20 at 14:11

António Ribeiro

4,129
5
32
49

answered May 13 '20 at 10:41

krishna Bharadwaj

1
1

Hi, I am reviewing your post. Although a good answer, it is always helpful for the user to add a few line of code. It is also better to upload an image directly into the question, since link addresses may change over time. – rainer May 13 '20 at 12:08

Encoding JSON in UTF-16 or UTF-32

2 Answers2

Linked