Python JSON and Unicode

Question

Update :

I found the answer here : Python UnicodeDecodeError - Am I misunderstanding encode?

I needed to explicitly decode my incoming file into Unicode when I read it. Because it had characters that were neither acceptable ascii nor unicode. So the encode was failing when it hit these characters.

Original Question

So, I know there's something I'm just not getting here.

I have an array of unicode strings, some of which contain non-Ascii characters.

I want to encode that as json with

json.dumps(myList)

It throws an error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xb4 in position 13: ordinal not in range(128)

How am I supposed to do this? I've tried setting the ensure_ascii parameter to both True and False, but neither fixes this problem.

I know I'm passing unicode strings to json.dumps. I understand that a json string is meant to be unicode. Why isn't it just sorting this out for me?

What am I doing wrong?

Update : Don Question sensibly suggests I provide a stack-trace. Here it is. :

Traceback (most recent call last):
  File "importFiles.py", line 69, in <module>
    x = u"%s" % conv
  File "importFiles.py", line 62, in __str__
    return self.page.__str__()
  File "importFiles.py", line 37, in __str__
    return json.dumps(self.page(),ensure_ascii=False)
  File "/usr/lib/python2.7/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/usr/lib/python2.7/json/encoder.py", line 204, in encode
    return ''.join(chunks)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb4 in position 17: ordinal not in range(128)

Note it's python 2.7, and the error is still occurring with ensure_ascii=False

Update 2 : Andrew Walker's useful link (in comments) leads me to think I can coerce my data into a convenient byte format before trying to json.encode it by doing something like :

data.encode("ascii","ignore")

Unfortunately that is throwing the same error.

You may also want to check out Ned Batchelder's slides from PyCon 2012 to get a deeper understanding of unicode, and how it relates to Python 2 and 3. http://nedbatchelder.com/text/unipain.html — Andrew Walker, Mar 13 '12 at 23:32
Also I'm asking about output because output context matters. If your output context isn't Unicode it will have to convert to whatever it accepts. In this case, ASCII. I get this issue when I try to dump Unicode to the command line — tkone, Mar 13 '12 at 23:33
tkone I'm just trying to write a simple command line script to convert some text files into pages for Smallest Federated Wiki ( https://github.com/WardCunningham/Smallest-Federated-Wiki), which uses json as its file format. Eventually the json just gets written to a file. But the error is blowing up during the json encoding, before I try to write to the file. — interstar, Mar 13 '12 at 23:38
Don Question. OK, point taken. I do "accept" answers when I feel I get solutions. But, like I say, I tend to ask fairly open-ended questions and while I get lots of good partial clues towards what I'm trying to solve, often I don't get THE answer in an easily recognisable package. It smells to me like StackOverflow have screwed up the culture here with their incentive system. I'm starting to prefer the Quora "thank" option which has the politeness without people getting upset. — interstar, Mar 13 '12 at 23:46
SO is more for immediate problems, for which you need a quick solution. It's faster then mailinglists and more elaborate then IRCs. The best of two worlds. To get this "fast" response working, reptutation seems like a good approach to me. AND you get more then one view quickly, because there is sometin like a "competition" between all participants. In the end you'r hurting only yourself with a low AR. ;-) — Don Question, Mar 13 '12 at 23:53

Don Question · Answer 1 · 2012-03-14T01:48:17.303

7

Try adding the argument: ensure_ascii = False. Also especially if asking unicode-related issues it's very helpful to add a longer (complete) traceback and stating which python-version you are using.

Citing the python-documentation: of version 2.6.7 :

"If ensure_ascii is False (default: True), then some chunks written to fp may be unicode instances, subject to normal Python str to unicode coercion rules. Unless fp.write() explicitly understands unicode (as in codecs.getwriter()) this is likely to cause an error."

So this proposal may cause new problems, but it fixed a similar problem i had. I fed the resulting unicode-String into a StringIO-object and wrrote this to a file.

Because of python 2.7 and sys.getdefaultencoding set to ascii the implicit conversion through the ''.join(chunks) statement of the json-standard-library will blow up if chunks is not ascii-encoded! You must ensure that any contained strings are converted to an ascii-compatible representation before-hand! You may try utf-8 encoded strings, but unicode-strings won't work if i'm not mistaken.

edited Mar 14 '12 at 01:48

answered Mar 13 '12 at 23:30

Don Question

11,227
5
36
54

Thanks Don Question. I added the stack-trace to the question. Unfortunately it doesn't seem to be the ensure_ascii flag. – interstar Mar 14 '12 at 00:12
sys.getdefaultencoding gives ascii back? (would be reasonable for python 2.7) if so, then `''.join(chunks)` will autmagically try to convert `chunks` to ascii, if __str__ does so. i guess `conv` and `importFiles` are from your own code? – Don Question Mar 14 '12 at 01:13
Yep. getdefaultencoding() comes back with ascii. Those names are from my code, yes. So how is the json library meant to cope if its going to be forced to coerce into Ascii with no error handler by system defaults? Is this an incompatibility between the standard library and the version of Python? – interstar Mar 14 '12 at 01:44
no! it's meant to be this way, because strings are internally not handled as unicodes in bython-2.x! So it's a pain in the ass ... - you can't never gloss over string-conversion and hope it just works. You have always to struggle with the conversions. You may take a peak into the source of `/usr/lib/python2.7/json/encoder.py` to get a glimpse of what is going on. You will see that the developers did try to cope as best as they can, but it's not very intuitive for the 1st time you do this. They assume you'r not giving unicodes and you try to play nice by providing them = DESASTER! ;-) – Don Question Mar 14 '12 at 01:54
It's hard to tell if someone understands the concept of unicode and it's encodings. But if you just assume they didn't they may take that as an insult regardless if they do or don't! ;-) I tried to illustrate this issue graphically: http://stackoverflow.com/a/8592536/1107807 ;-) And the problem remains that in python 2.x string-representations are just a plain pain in the ass! ;-) – Don Question Mar 14 '12 at 02:32

Python JSON and Unicode

Update :

Original Question

1 Answers1

Linked