Yet another person unable to find the correct magic incantation to get Python to print UTF-8 characters.
I have a JSON file. The JSON file contains string values. One of those string values contains the character "à". I have a Python program that reads in the JSON file and prints some of the strings in it. Sometimes when the program tries to print the string containing "à" I get the error
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 12: ordinal not in range(128)
This is hard to reproduce. Sometimes a slightly different program is able to print the string "à". A smaller JSON file containing only this string does not exhibit the problem. If I start sprinkling encode('utf-8')
and decode('utf-8')
around the code it changes what blows up in unpredictable ways. I haven't been able to create a minimal code fragment and input that exhibits this problem.
I load the JSON file like so.
with codecs.open(filename, 'r', 'utf-8') as f:
j = json.load(f)
I'll pull out the offending string like so.
s = j['key']
Later I do a print
that has s
as part of it and see the error.
I'm pretty sure the original file is in UTF-8 because in the interactive command line
codecs.open(filename, 'r', 'utf-8').read()
returns a string but
codecs.open(filename, 'r', 'ascii').read()
gives an error about the ascii codec not being able to decode such-and-such a byte. The file size in bytes is identical to the number of characters returned by wc -c
, and I don't see anything else that looks like a non-ASCII character, so I suspect the problem lies entirely with this one high-ASCII "à".
I am not making any explicit calls to str()
in my code.
I've been through the Python Unicode HOWTO multiple times. I understand that I'm supposed to "sandwich" unicode handling. I think I'm doing this, but obviously there's something I still misunderstand.
Mostly I'm confused because it seems like if I specify 'utf-8' in the codecs.open
call, everything should be happening in UTF-8. I don't understand how the ASCII codec still sneaks in.
What am I doing wrong? How do I go about debugging this?
Edit: Used io
module in place of codecs
. Same result.
Edit: I don't have a minimal example, but at least I have a minimal repro scenario.
I am printing an object derived from the strings in the JSON that is causing the problem. So the following gives an error.
print(myobj)
(Note that I am using from __future__ import print_function
though that does not appear to make a difference.)
Putting an encode('utf-8')
on the end of my object's __str__
function return value does not fix the bug. However changing the print line to this does.
print("%s" % myobj)
This looks wrong to me. I'd expect these two print calls to be equivalent.
I can make this work by doing the sys.setdefaultencoding hack:
import sys
reload(sys)
sys.setdefaultencoding("UTF-8")
But this is apparently a bad idea that can make Python malfunction in other ways.
What is the correct way to do this? I tried
env PYTHONIOENCODING=UTF-8 ./myscript.py
but that didn't work. (Unsurprisingly, since the issue is the default encoding, not the io encoding.)