0

I have a String like this which is coming in a JSON processing data call\\U007fabc computers when I try to parse it jackson throwsn an exception like this:

org.codehaus.jackson.JsonParseException: Unrecognized character escape 'U' (code 85)
 at [Source: java.io.StringReader@1b43c429; line: 1, column: 361]
        at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1292)
        at org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:385)
        at org.codehaus.jackson.impl.JsonParserMinimalBase._handleUnrecognizedCharacterEscape(JsonParserMinimalBase.java:360)
        at org.codehaus.jackson.impl.ReaderBasedParser._decodeEscaped(ReaderBasedParser.java:1064)
        at org.codehaus.jackson.impl.ReaderBasedParser._finishString2(ReaderBasedParser.java:785)
        at org.codehaus.jackson.impl.ReaderBasedParser._finishString(ReaderBasedParser.java:762)

I think the problem is happening because of \\U007f. It definitely means something in UTF-8. Any idea how we can avoid this issue? Does JsonParser.Feature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER will help anything here?

john
  • 11,311
  • 40
  • 131
  • 251
  • Questions seeking debugging help ("why isn't this code working?") must include the desired behavior, a specific problem or error and the shortest code necessary to reproduce it in the question itself. (You need to show us the actual JSON text, not simply a few characters from it.) – Hot Licks Jun 25 '15 at 01:00

2 Answers2

2

Your JSON data is malformed.

JSON uses the \u escape sequence to encode a UTF-16 codeunit.

In this case, your JSON data is trying to escape Unicode codepoint U+007F DELETE (which is an ASCII control character that is not required by the JSON spec to be escaped, but is allowed to be escaped), but is using the \U escape sequence to do so. The JSON spec explicitly states that \u MUST be used:

A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). All characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark (U+0022), reverse solidus (U+005C), and the control characters U+0000 to U+001F. There are two-character escape sequence representations of some characters.

...

Any code point may be represented as a hexadecimal number. The meaning of such a number is determined by ISO/IEC 10646. If the code point is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point.

...

To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair.

Although not explicitly stated in that last paragraph, the twelve-character sequence for a UTF-16 surrogate pair consists of two six-character sequences that must follow the same escape format as characters in the BMP. This is enforced by the character encoding diagram:

diagram
(source: json.org)

There is no \U escape sequence defined. That is what the parser error message is complaining about:

Unrecognized character escape 'U'

Community
  • 1
  • 1
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • So basically it means we are getting data like this `\\U007` somehow which is not valid? Or after encoding it is becoming like this? – john Jun 25 '15 at 02:02
  • Presumably your JSON does not actually contain `\\U` but just `\U` or else it would not be processed as an escape sequence to begin with. Are you are looking at the JSON data in a debugger or something? But yes, `\U` (uppercase U) is not valid, it must be `\u` (lowercase u) instead: `\u007f` (the `f` is part of the encoded sequence) – Remy Lebeau Jun 25 '15 at 02:13
1

Unicode Character U+007F DELETE is probably what you are facing.

This answer states that it shouldnt have been encoded.

However to circumvent, you can refer to this answer on how to strip them off.

Community
  • 1
  • 1
suvartheec
  • 3,484
  • 1
  • 17
  • 21
  • Thanks for suggestion. So you are saying this shouldn't have been encoded at all right? If it wouldn't have been encoded, then how will it look like then? – john Jun 24 '15 at 01:55