Unicode code point U+D800 may only occur as part of a surrogate pair (and then only in UTF-16 encoding). So that string inside the JSON is (after decoding it) not valid UTF-8.
The JSON itself might or might not be valid. The spec doesn't mention the case of unmatched surrogate pairs, but does explicitly allow nonexistent code points:
To escape a code point that is not in the Basic Multilingual Plane, the character may be represented as a twelve-character sequence, encoding the UTF-16 surrogate pair corresponding to the code point. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". However, whether a processor of JSON texts interprets such a surrogate pair as a single code point or as an explicit surrogate pair is a semantic decision that is determined by the specific processor.
Note that the JSON grammar permits code points for which Unicode does not currently provide character assignments.
Now, you can choose your friends, but you can't choose your family and you can't always choose your JSON either. So the next question is: how to parse this mess?
It looks like both the built-in json
module in Python (version 3.9) and simplejson
(version 3.17.2) have no problems parsing the JSON. The problem only occurs once you try to use the string. So this really doesn't have anything to do with JSON at all:
>>> bork = '\ud800'
>>> bork
'\ud800'
>>> print(bork)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed
Fortunately, we can encode the string manually and tell Python how to handle the error. For example, replace the erroneous code point with a question mark:
>>> bork.encode('utf-8', errors='replace')
b'?'
The documentation lists other possible options for the errors
argument.
To fix up this broken string, we can encode (into bytes
) and then decode (back into str
):
>>> bork.encode('utf-8', errors='replace').decode('utf-8')
'?'