Python json Ignoring non-ascii chars UnicodeDecodeError: 'ascii' codec cant decode byte

Question

Python 2.7.3

I have read all related threads around json/dumps UnicodeDecodeError and most of them want me to understand what encoding I need. In my case I am creating a json with various key values coming from various services (some p4 command lines) possibly different encoding. I have a map something like this

map = {"system1": some_data_from_system1, "system2", some_data_from_system2}
json.dumps(map)

This throws an "UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 737: ordinal not in range(128)"

I would like to have ASCII characters dumped into a file occasionally a p4 checkin/jira might have non-ascii chars and it is perfectly okay to ignore this. I have tried "ensure_ascii = False" and it does not solve the problem. What I really want is the Encoder to simply ignore any non-ascii chars on the way. I think this is reasonable but cannot find any way out.

Suggestions?

Please do include a *reproducible sample*; what are the values of `some_data_from_system1` and `some_data_from_system2` here? What is the full traceback of your exception? — Martijn Pieters, May 27 '14 at 10:34
Sorry how does the data matter? These could be different encodings and it *SHOULD* not matter for a generic logger. The purpose is to dump as much as possible into a file for human reading (to debug later) — Kannan Ekanath, May 27 '14 at 10:54
The problem is that the code shown *does not throw that exception*; it'd throw an exception about not being able to decode UTF-8, at most, for example. — Martijn Pieters, May 27 '14 at 10:56
Having an actual example that shows your problem with a matching exception would have given us a far clearer picture of what you are trying to do here. You've included some stabs in the dark here, but the fact that you don't care if non-ASCII codepoints are lost isn't clear, for example. — Martijn Pieters, May 27 '14 at 10:59
Why is it not clear? It is a python error logging system. When an event happens (could be an exception or whatever) it simply takes some inputs "usually json strings" from the subsystems and sends an email with the output. It is OKAY if I lose non-ASCII code points because the data might contain description of a user request, it does not matter! I want to dump whatever I can* for later perusal. I still believe this is both a sensible and valid requirement. — Kannan Ekanath, May 27 '14 at 11:09
I think where your post went wrong is to claim that the code you showed throws a specific exception. You included code, but the code is incomplete and doesn't, in fact, throw that exception with that message. — Martijn Pieters, May 27 '14 at 11:10
You then describe how you tried `ensure_ascii = False` but merely dismissed it as "doesn't solve the problem" without describing how it failed. Together that makes for a useless code sample, incorrect error information, and no information on how the second attempt didn't work. — Martijn Pieters, May 27 '14 at 11:11

score 2 · Answer 1 · answered May 27 '14 at 11:08

The json.dumps() and json.dump() functions will try to decode byte strings to Unicode values when passed in, using UTF-8 by default:

>>> map = {"system1": '\x92'}
>>> json.dumps(map)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/__init__.py", line 243, in dumps
    return _default_encoder.encode(obj)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/encoder.py", line 207, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/encoder.py", line 270, in iterencode
    return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte
>>> map = {"system1": u'\x92'.encode('utf8')}
>>> json.dumps(map)
'{"system1": "\\u0092"}'

You can set the encoding keyword argument to use a different encoding for byte string (str ) characters.

These functions do this because JSON is a standard that uses Unicode for all strings. If you feed it data that is not encoded as UTF-8, this fails, as shown above.

On Python 2 the output is a byte string too, encoded to UTF-8. IT can be safely written to a file. Setting the ensure_ascii argument to False would change that and you'd get Unicode instead, which you clearly don't want.

So you need to ensure that what you put into the json.dumps() function is consistently all the same encoding, or is already decoded to unicode objects. If you don't care about the occasional missed codepoint, you'd do so with forcing a decode with the error handler set to replace or ignore:

map = {"system1": some_data_from_system1.decode('ascii', errors='ignore')}

This decodes the string forcibly, replacing any bytes that are not recognized as ASCII codepoints with a replacement character:

>>> '\x92'.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 0: ordinal not in range(128)
>>> '\x92'.decode('ascii', errors='replace')
u'\ufffd'

Here a U+FFFD REPLACEMENT CHARACTER codepoint is inserted instead to represent the unknown codepoint. You could also completely ignore such bytes by using errors='ignore'.

score 0 · Answer 2 · edited May 23 '17 at 12:32

I have used a combination of How to get string objects instead of Unicode ones from JSON in Python? and the answer above to do this piece of logging.

As stated above the some_data_from_system{1|2} are not strings. The question is about a general error logging system. When things go wrong you want to dump as much information from several subsystems as possible for human inspection. The subsystems change between environments and it is not always known what encoding these use when they return "jsons" representing what went/was wrong. To this effect I have the following method stolen from the other thread but the essence basically is the decode method with the "ignore"

PLEASE NOTE: This is not a very performant method (most blind recursions are usually not). So this is not suitable for a typical production application; Depending upon the data it is possibly to run into an infinite loop. However assuming you understand the disclaimers it is okay for error logging systems.

def convert_encoding(data, encoding = 'ascii'):
    if isinstance(data, dict):
        return dict((convert_encoding(key), convert_encoding(value)) \
             for key, value in data.iteritems())
    elif isinstance(data, list):
        return [convert_encoding(element) for element in data]
    elif isinstance(data, unicode):
        return data.encode(encoding, 'ignore')
    else:
        return data

map = {"system1": some_data_from_system1, "system2", some_data_from_system2}
json.dumps(convert_encoding(map), ensure_ascii = False)

Once done this generic method can be used to dump data.

score 0 · Answer 3 · edited Mar 04 '15 at 21:11

0

If your string in json format has non ASCII characters and you need to use accordingly python's dump method:

myString = "{key: 'Brazilian Portuguese has many differenct characters like  maçã (apple) or Bíblia (Blible)' }" # or a map
myJSON = json.dumps(myString, encoding="latin-1") #use utf8 if appropriate
myJSON = json.loads(myJSON)

edited Mar 04 '15 at 21:11

rink.attendant.6

44,500
61
101
156

answered Mar 04 '15 at 21:10

tcruzfranca

1

Python json Ignoring non-ascii chars UnicodeDecodeError: 'ascii' codec cant decode byte

3 Answers3