5

I have a JSON file, which contains JSON from Clojure's data.json library. The data came from Twitter where people seem to smile a lot.

$ cat /tmp/myfile | jq .

I get:

parse error: Invalid \uXXXX\uXXXX surrogate pair escape at line 1, column 14862268

The offending section is:

$ cut -c 14862258-14862269 /tmp/2017-02-23-2
79-7\ud83d",

So, this escape code was found in a real JSON file and JQ can't read it.

echo '"\ud83d"' | jq .

Fileformat.info seems to suggest that it should come in a pair:

SMILING FACE WITH OPEN MOUTH
"\uD83D\uDE03"
  1. Is this truly an invalid character to find in a JSON file? Is my JSON technically invalid?

  2. Is there a simple utility I can pipe the data through to strip out these characters prior to JQ? Or can I make JQ relax it interpretation?

peak
  • 105,803
  • 17
  • 152
  • 177
Joe
  • 46,419
  • 33
  • 155
  • 245

2 Answers2

7

The JSON specification says:

A string is a sequence of zero or more Unicode characters [UNICODE].

In this sense, the string "\ud83d" is NOT valid JSON ("+UD83D is not a valid Unicode character"), even though it conforms with the JSON ABNF. As the standards document goes on to say, there is a discrepancy between the string specification and the ABNF:

the ABNF in this specification allows member names and string values to contain bit sequences that cannot encode Unicode characters; for example, "\uDEAD" (a single unpaired UTF-16 surrogate). Instances of this have been observed, for example, when a library truncates a UTF-16 string without checking whether the truncation split a surrogate pair. The behavior of software that receives JSON texts containing such values is unpredictable ...

So it would be fair to say that:

  1. "\uD83D" is not strictly valid JSON, even though it conforms to the ABNF;

  2. jq is within its rights here;

  3. jsonlint is wrong to accept "\uD83D".

“... strip out these characters”

See e.g How to remove non UTF-8 characters from text file

Community
  • 1
  • 1
peak
  • 105,803
  • 17
  • 152
  • 177
  • I don’t think this is the only or the correct reading of the spec. Your second quote is from the section titled ‘String and character issues’; this is just guidance: it doesn’t make any additional conformance requirements about defective surrogates, it simply describes that a client *might* fail handling such strings. The required behaviour seems unspecified and the working group has rejected a request to improve the spec in [erratum 3984](https://www.rfc-editor.org/errata_search.php?rfc=7159&eid=3984). – glts Feb 27 '17 at 22:11
2

It's definitely valid json, but the code unit D83D by itself is invalid. Remember, jq isn't merely interpreting the json, it's trying to get its value. So that's not just a stream of characters stored in json anymore once consumed by jq, it's a string with a definite value.

That value is a high surrogate, it must come in pairs which your input apparently doesn't have. So the string encoded in the file, while valid json, doesn't represent a valid unicode string which jq is trying to parse into.

You need to go through your json and complete the pair(s) if you want to be able to parse it using jq.


If you could at least ensure that it is valid json, you could probably use regular expressions to scan through the data to search for mismatched surrogates. Something like this:

\\u[Dd][89ABab][0-9A-Fa-f]{2}(?!\\u[Dd][C-Fc-f][0-9A-Fa-f]{2})
|
(?<!\\u[Dd][89ABab][0-9A-Fa-f]{2})\\u[Dd][C-Fc-f][0-9A-Fa-f]{2}

Then you could either strip them off or make a best guess at the missing surrogate.

glts
  • 21,808
  • 12
  • 73
  • 94
Jeff Mercado
  • 129,526
  • 32
  • 251
  • 272
  • Yes, and since the data comes from a Clojure library I think it’s worth pointing out that Clojure strings consist of UTF-16 `char`s, and so the bad data might actually come from data.json, if it doesn’t handle surrogate pairs correctly. Indeed this used to be [a bug in data.json](http://dev.clojure.org/jira/browse/DJSON-3) but it’s been fixed long ago. – glts Feb 25 '17 at 18:30
  • Thanks for the pointers. I think I know what's going on. I'm splitting Java strings character-wise at some point, and I might be orphaning chars at that point. – Joe Feb 25 '17 at 19:00
  • Thank you! I should point out that Java has `Character.isSurrogate`, which I'm now using. – Joe Feb 27 '17 at 21:05
  • @JeffMercado - To avoid confusion, perhaps you could reword the first sentence in your response, to make it clear you're referring to the syntax (ABNF), not the semantics specified by the JSON standard. – peak Feb 27 '17 at 22:38