1

Yet another person unable to find the correct magic incantation to get Python to print UTF-8 characters.

I have a JSON file. The JSON file contains string values. One of those string values contains the character "à". I have a Python program that reads in the JSON file and prints some of the strings in it. Sometimes when the program tries to print the string containing "à" I get the error

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 12: ordinal not in range(128)

This is hard to reproduce. Sometimes a slightly different program is able to print the string "à". A smaller JSON file containing only this string does not exhibit the problem. If I start sprinkling encode('utf-8') and decode('utf-8') around the code it changes what blows up in unpredictable ways. I haven't been able to create a minimal code fragment and input that exhibits this problem.

I load the JSON file like so.

with codecs.open(filename, 'r', 'utf-8') as f:
    j = json.load(f)

I'll pull out the offending string like so.

s = j['key']

Later I do a print that has s as part of it and see the error.

I'm pretty sure the original file is in UTF-8 because in the interactive command line

codecs.open(filename, 'r', 'utf-8').read()

returns a string but

codecs.open(filename, 'r', 'ascii').read()

gives an error about the ascii codec not being able to decode such-and-such a byte. The file size in bytes is identical to the number of characters returned by wc -c, and I don't see anything else that looks like a non-ASCII character, so I suspect the problem lies entirely with this one high-ASCII "à".

I am not making any explicit calls to str() in my code.

I've been through the Python Unicode HOWTO multiple times. I understand that I'm supposed to "sandwich" unicode handling. I think I'm doing this, but obviously there's something I still misunderstand.

Mostly I'm confused because it seems like if I specify 'utf-8' in the codecs.open call, everything should be happening in UTF-8. I don't understand how the ASCII codec still sneaks in.

What am I doing wrong? How do I go about debugging this?


Edit: Used io module in place of codecs. Same result.


Edit: I don't have a minimal example, but at least I have a minimal repro scenario.

I am printing an object derived from the strings in the JSON that is causing the problem. So the following gives an error.

print(myobj)

(Note that I am using from __future__ import print_function though that does not appear to make a difference.)

Putting an encode('utf-8') on the end of my object's __str__ function return value does not fix the bug. However changing the print line to this does.

print("%s" % myobj)

This looks wrong to me. I'd expect these two print calls to be equivalent.


I can make this work by doing the sys.setdefaultencoding hack:

import sys
reload(sys)
sys.setdefaultencoding("UTF-8")

But this is apparently a bad idea that can make Python malfunction in other ways.

What is the correct way to do this? I tried

env PYTHONIOENCODING=UTF-8 ./myscript.py

but that didn't work. (Unsurprisingly, since the issue is the default encoding, not the io encoding.)

Community
  • 1
  • 1
W.P. McNeill
  • 16,336
  • 12
  • 75
  • 111
  • what is `print(repr(myobj))`? – jfs Feb 08 '15 at 22:55
  • print(repr(myobj)) does not throw an error. It prints the repr string as expected. (e.g. escaped "\n" instead of newlines.) – W.P. McNeill Feb 10 '15 at 17:35
  • I've asked what is it (the object that causes the issue) and not whether `repr(myobj)` causes an error (it shouldn't). Do you use a custom object (custom `__unicode__`, `__str__`) to parse json? Why? – jfs Feb 10 '15 at 20:37

1 Answers1

3

When you write directly to a file or redirect stdout to a file or pipe the default encoding is ASCII and you have to encode Unicode strings before writing them. With opened file handles you can set an encoding to have this happen automatically but with print you must use an encode() method.

print s.encode('utf-8')

It is recommended to use the newer io module in place of codecs because it has an improved implementation and is forward compatible with Py3.x open().

Kevin Thibedeau
  • 3,299
  • 15
  • 26
  • you should *not* encode Unicode before printing it. Don't hardcode the character encoding of your environment inside your script. Set proper locale settings (LANG, LC_CTYPE, LC_ALL) and/or `PYTHONIOENCODING` envvar instead. – jfs Feb 08 '15 at 23:21