7

Inside a Twisted Resource, I am returning a json encoded dict as the response var below. The data is a list of 5 people with name, guid, and a couple other fields less than 32 characters long each, so not a ton of data.

I get this OverflowError exception pretty often, but I don't quite understand what the unsupported utf-8 sequence length refers to.

self.request.write(ujson.dumps(response))

exceptions.OverflowError: Unsupported UTF-8 sequence length when encoding string

Coder1
  • 13,139
  • 15
  • 59
  • 89

2 Answers2

18

Just a note that I recently encountered this same error, and can give a little background.

If you see this, it's possible you're trying to json encode a Mongo Object with ujson in python.

Using the native python library, we get a more helpful error message:

TypeError: ObjectId('510652d322fc956ca9e41342') is not JSON serializable

ujson is somehow trying to parse an ObjectId python object and getting lost. There are a few options, the most direct being wiping the '_id' field from Mongo before saving. You could also subclass ujson to somehow parse or munge the ObjectIds into a simple character string.

Peter V
  • 613
  • 4
  • 11
  • i tried to modify `json_util`in `bson.py` (pymongo) and replaced the `import json` with `import ujson as json` it dident work, they dont share the methods :( – Abdelouahab Pp Apr 07 '13 at 18:37
  • You saved the day. – imichaeldotorg Apr 25 '19 at 16:20
  • 3
    This can be solved by setting `default_handler` argument to `str`, like this: `jsonResult = df.to_json(default_handler=str)`. The issue has been discussed here: https://github.com/pandas-dev/pandas/issues/14256 and contains explanations. – Ivan Sivak Jan 14 '20 at 08:01
3

When in doubt, check the source: http://code.google.com/p/rapidjson/source/browse/trunk/thirdparty/ultrajson/ultrajsonenc.c

This error happens when the UTF-8 length is 5 or 6 bytes. This JSON implementation doesn't implement that. Those characters won't work if you're using the data in a browser anyway, since they're outside the range of UTF-16.

I'd be surprised if this actually happened often; it'd only happen with Unicode codepoints over U+1FFFFF, which are vanishingly rare, and not even supported in Unicode strings by most builds of Python due to being outside this range. You should find out why these characters are showing up in your data.

Glenn Maynard
  • 55,829
  • 10
  • 121
  • 131
  • Thanks Glenn. Still getting used to Python and thought it was a Twisted issue, didn't think to look at ujson since it was working fine with other data. The data does come into the app over a socket connection, so that is most likely the culprit. Thanks a lot. – Coder1 Dec 07 '11 at 21:24
  • 1
    I don't see why "outside the BMP" is particularly relevant to the question of whether a browser can render a glyph for a particular code point. It also seems to me like this really qualifies as a bug in the implementation; the JSON spec is quite explicit that a "char" is "any Unicode character except double-quote or backslash or a control character". – Karl Knechtel Dec 07 '11 at 22:27
  • 1
    @Karl: Just a typo; it's the range that matters: [0,0x1FFFFF]. JavaScript uses UTF-16, which can only represent codepoints in that range. In practice, JSON serializers that output ASCII use UTF-16 surrogates, and can only output this range; JSON has no 8-byte Unicode escape. – Glenn Maynard Dec 09 '11 at 05:03
  • 2
    The verdict, I was storing the data in MongoDB. The error came from the default _id value Mongo returns from the db. I unset that field and the errors went away. Thanks again for pointing me in the right direction. – Coder1 Dec 09 '11 at 05:38