The json.dumps()
and json.dump()
functions will try to decode byte strings to Unicode values when passed in, using UTF-8 by default:
>>> map = {"system1": '\x92'}
>>> json.dumps(map)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/__init__.py", line 243, in dumps
return _default_encoder.encode(obj)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/encoder.py", line 270, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte
>>> map = {"system1": u'\x92'.encode('utf8')}
>>> json.dumps(map)
'{"system1": "\\u0092"}'
You can set the encoding
keyword argument to use a different encoding for byte string (str
) characters.
These functions do this because JSON is a standard that uses Unicode for all strings. If you feed it data that is not encoded as UTF-8, this fails, as shown above.
On Python 2 the output is a byte string too, encoded to UTF-8. IT can be safely written to a file. Setting the ensure_ascii
argument to False
would change that and you'd get Unicode instead, which you clearly don't want.
So you need to ensure that what you put into the json.dumps()
function is consistently all the same encoding, or is already decoded to unicode
objects. If you don't care about the occasional missed codepoint, you'd do so with forcing a decode with the error handler set to replace
or ignore
:
map = {"system1": some_data_from_system1.decode('ascii', errors='ignore')}
This decodes the string forcibly, replacing any bytes that are not recognized as ASCII codepoints with a replacement character:
>>> '\x92'.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 0: ordinal not in range(128)
>>> '\x92'.decode('ascii', errors='replace')
u'\ufffd'
Here a U+FFFD REPLACEMENT CHARACTER codepoint is inserted instead to represent the unknown codepoint. You could also completely ignore such bytes by using errors='ignore'
.