ValueError: Unpaired high surrogate when decoding 'string' on reading json file

Question

I am currently working on python 3.8.6. I am getting the following error on reading (thousands of) json files in python:

ValueError: Unpaired high surrogate when decoding 'string' on reading json file

I tried using the following solutions while checking other stackoverflow posts but nothing worked:

1) import json
   json.loads('{"":"\\ud800"}')

2) import simplejson
   simplejson.loads('{"":"\\ud800"}')

The problem is that after getting this error the remaining json files are not read. Is there a way to get rid of this error so I can read all the json files?

I am not sure what all information is necessary to provide regarding the problem so please feel free to ask.

@tripleee thank you for looking into it. the problem is that after getting this error the remaining json files are not read. Is there a way to get rid of this error so I can read all the json files? — Samiksha, Feb 02 '21 at 14:01
If you can guess what the missing data is, add it back. My answer has some more details around this. — tripleee, Feb 02 '21 at 14:11
The examples you provide do not raise an exception. It is some other part of your code that is encountering a problem with the data. — John, Feb 02 '21 at 14:14

score 2 · Answer 1 · answered Feb 02 '21 at 14:09

Unicode code point U+D800 may only occur as part of a surrogate pair (and then only in UTF-16 encoding). So that string inside the JSON is (after decoding it) not valid UTF-8.

The JSON itself might or might not be valid. The spec doesn't mention the case of unmatched surrogate pairs, but does explicitly allow nonexistent code points:

To escape a code point that is not in the Basic Multilingual Plane, the character may be represented as a twelve-character sequence, encoding the UTF-16 surrogate pair corresponding to the code point. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". However, whether a processor of JSON texts interprets such a surrogate pair as a single code point or as an explicit surrogate pair is a semantic decision that is determined by the specific processor.

Note that the JSON grammar permits code points for which Unicode does not currently provide character assignments.

Now, you can choose your friends, but you can't choose your family and you can't always choose your JSON either. So the next question is: how to parse this mess?

It looks like both the built-in json module in Python (version 3.9) and simplejson (version 3.17.2) have no problems parsing the JSON. The problem only occurs once you try to use the string. So this really doesn't have anything to do with JSON at all:

>>> bork = '\ud800'
>>> bork
'\ud800'
>>> print(bork)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

Fortunately, we can encode the string manually and tell Python how to handle the error. For example, replace the erroneous code point with a question mark:

>>> bork.encode('utf-8', errors='replace')
b'?'

The documentation lists other possible options for the errors argument.

To fix up this broken string, we can encode (into bytes) and then decode (back into str):

>>> bork.encode('utf-8', errors='replace').decode('utf-8')
'?'

tripleee · Answer 2 · 2021-02-02T14:13:40.287

A Unicode surrogate in isolation does not correspond to anything. Every valid high surrogate code point needs to be immediately followed by a low surrogate code point before it can be meaningfully decoded.

The error message simply means that this code point in isolation does not have a well-defined meaning. It's like saying "take" without saying what we should take, or "look at" without the object of the sentence filled in.

You should not be using surrogates in files which do not contain UTF-16 anyway; they are reserved strictly for this encoding. It is used for encoding characters outside the 16-bit space which this 16-bit encoding can naturally represent by way of splitting them across two code points.

The simple and obvious fix is to supply the missing information, but we can't know what it is. Perhaps you have more context, and can fill in with a correct low surrogate pair. But for example, this works:

>>> json.loads('{"":"\\ud800\\udc00"}')
{'': ''}

It populates the JSON with the single code point U+010000 but of course we can have no idea whether that's actually the code point your data should contain.

Strangely enough, inside JSON, surrogate pairs may be used to encode code points outside the BMP (see the last paragraphs of [the spec](https://www.ecma-international.org/wp-content/uploads/ECMA-404_2nd_edition_december_2017.pdf)). — Thomas, Feb 02 '21 at 14:02
@Thomas Thanks for pointing that out; added a brief sentence about this. I'll repeat again that this is not a valid use of surrogate pairs; JSON should normally not use this (though there are [legacy reasons](https://stackoverflow.com/questions/38463038/why-does-json-encode-utf-16-surrogate-pairs-instead-of-unicode-code-points-direc)). — tripleee, Feb 02 '21 at 14:07

ValueError: Unpaired high surrogate when decoding 'string' on reading json file

2 Answers2

Linked