Java: detect control characters which are not correct for JSON

Question

I am reinventing the wheel and creating my own JSON parse methods in Java.

I am going by the (very nice!) documentation on json.org. The only part I am unsure about is where it says "or control character"

Since the documentation is so clear, and JSON is so simple and easy to implement, I thought I would go ahead and require the spec instead of being loose.

How would I correctly strip out control characters in Java? Perhaps there is a unicode range?

enter image description here

Edit: A (commonly?) missing peice to the puzzle

I have been informed that there are other control characters outside of the defined range ¹ ² that can be troublesome in <script> tags.

Most notably the characters U+2028 and U+2029, Line and Paragraph Separator, which act as newlines. Injecting a newline into the middle of a string literal will most likely cause a syntax error (unterminated string literal). ³

Though I believe this does not pose an XSS threat, it is still a good idea to add extra rules for the use in <script> tags.

Just be simple and encode all non-"ASCII printable" characters with \u notation. Those characters are uncommon to begin with. If you like, you could add to the white-list, but I do recommend a white-list approach.
In case you are not aware, do not forget about </script (not case sensitive), which could cause HTML script injection to your page with the characters </script><script src=http://tinyurl.com/abcdef>. None of those characters are by default encoded in JSON.

Unicode is Unicode. UTF-16 is an encoding. I think Java has tests for Unicode groupings? See [the Character class documentation](http://download.oracle.com/javase/6/docs/api/java/lang/Character.html) for some preamble stuff and other interesting functions. — , May 18 '11 at 22:00
What I mean is, every character in a Java string is two bytes. Even if the data is ASCII, when converted to a string, it ends up two bytes per string. — 700 Software, May 18 '11 at 22:01
*"For those who don't know, Java operates with UTF-16 characters."* Well, yes, sort of. Java's `String` type stores string data internally in UTF-16, but Java is perfectly happy reading and writing using other encodings (including UTF-8 or Windows-1252 -- both commonly used -- and UTF-32). May be worth starting here: http://www.joelonsoftware.com/articles/Unicode.html — T.J. Crowder, May 18 '11 at 22:03
Don't worry, I understand string encoding even if I am not describing it right. — 700 Software, May 18 '11 at 22:04

score 9 · Accepted Answer · answered May 18 '11 at 22:04

9

Will Character.isISOControl(...) do? Incidentally, UTF-16 is an encoding of Unicode codepoints... Are you going to be operating at the byte level, or at the character/codepoint level? I recommend leaving the mapping from UTF-16 to character streams to Java's core APIs...

answered May 18 '11 at 22:04

Dilum Ranatunga

13,254
3
41
52

I am operating at the character level. Bytes are converted to string before the JSON parse begins. – 700 Software May 18 '11 at 22:06
I don't know if `isISOControl` is correct or not. I know it will do because this does not need to be strictly correct. :) – 700 Software May 18 '11 at 22:18
2

@George: Well, the docs say *"A character is considered to be an ISO control character if its code is in the range `'\u0000'` through `'\u001F'` or in the range `'\u007F'` through `'\u009F'`"* As that matches the definition I linked to of a Unicode control character, I'd say @Dilum is on a winner... :-) (Though being the pedant I am, I'd probably want to find a reference saying that the two really were linked, so that if one changes, I don't have to worry about them getting out of sync.) But that's probably pedantry. – T.J. Crowder May 18 '11 at 22:33
@T.J.: +1 to you and jarnbjo. Accepting Dilum's answer because that is what I ended up using. – 700 Software May 18 '11 at 22:39
@George: Entirely reasonable! :-) – T.J. Crowder May 19 '11 at 05:56

score 6 · Answer 2 · answered May 18 '11 at 22:04

Even if it's not very specific, I would assume that they refer to the "control" character category from the Unicode specification.

In Java, you can check if a character c is a Unicode control character with the following expression: Character.getType(c) == Character.CONTROL.

score 5 · Answer 3 · edited Oct 07 '21 at 08:57

I know the question has been asked a couple of years ago, but I am replying anyway, because the accepted answer is not correct.

Character.isISOControl(int codePoint)

does the following check:

(codePoint >= 0x00 && codePoint <= 0x1F) || (codePoint >= 0x7F && codePoint <= 0x9F);

The JSON specification defines at https://www.rfc-editor.org/rfc/rfc7159:

Strings

The representation of strings is similar to conventions used in the C family of programming languages. A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks, except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

Character.isISOControl(int codePoint)

will flag all characters that need to be escaped (U+0000-U+001F), though it will also flag characters that do not need to be escaped (U+007F-U+009F). It is not required to escape the characters (U+007F-U+009F).

score 4 · Answer 4 · answered May 18 '11 at 22:06

4

I believe the Unicode definition of a control character is:

The 65 characters in the ranges U+0000..U+001F and U+007F..U+009F.

That's their definition of a control code, but the above is followed by the sentence "Also known as control characters.", so...

answered May 18 '11 at 22:06

T.J. Crowder

1,031,962
187
1,923
1,875

Java: detect control characters which are not correct for JSON

Edit: A (commonly?) missing peice to the puzzle

4 Answers4

Linked