Different unicode handling python2.7.9 vs 2.7.15

Question

I have a large project which runs fine with 2.7.9 on many devices.

But now the devices use python 2.7.15 and in some cases it crashes, when someone uses umlaute/eszett like äöüß.
In that case, a line like this raised an exception

logger.info("device name {}".format(device_name))

I build a minimal test.py to reproduce the problem.

# -*- coding: utf-8 -*-

import locale
import os
import sys

print("#1 sys.stdout.encoding={}".format(sys.stdout.encoding))
print("#2 {}".format(locale.getdefaultlocale()))

u = u'aé ä ö ü ß'
print("#repr: " + repr(u.encode('utf-8')))
print("#3 type(u)={}".format(type(u)))

print(u.encode('utf-8', errors='ignore'))

print("#5 u={}".format(u))

With python 2.7.9 it's fine

#1 sys.stdout.encoding=ANSI_X3.4-1968
#2 (None, None)
#repr: 'a\xc3\xa9 \xc3\xa4 \xc3\xb6 \xc3\xbc \xc3\x9f'
#3 type(u)=<type 'unicode'>
aé ä ö ü ß
#5 u=aé ä ö ü ß

This fails only with 2.7.15, output:

#1 sys.stdout.encoding=ANSI_X3.4-1968
#2 (None, None)
#repr: 'a\xc3\xa9 \xc3\xa4 \xc3\xb6 \xc3\xbc \xc3\x9f'
#3 type(u)=<type 'unicode'>
aé ä ö ü ß
Traceback (most recent call last):
  File "utf8.py", line 16, in <module>
    print("#5 u={}".format(u))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)

Even when I used:

export PYTHONIOENCODING="UTF-8"
export LC_ALL=en_GB.utf8
export LANG=en_GB.utf8

This alters the output, but doesn't help

#1 sys.stdout.encoding=UTF-8
#2 ('en_GB', 'UTF-8')
#repr: 'a\xc3\xa9 \xc3\xa4 \xc3\xb6 \xc3\xbc \xc3\x9f'
#3 type(u)=<type 'unicode'>
aé ä ö ü ß
Traceback (most recent call last):
  File "utf8.py", line 16, in <module>
    print("#5 u={}".format(u))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)

I can fix this error with:

reload(sys)
sys.setdefaultencoding('utf8')

But this solution seems to be very discouraged and I fear the side effects.

But how to fix it in a sane way?
Currently, update to python3 isn't an option.

@tripleee The file is utf-8 encoded. I edited the text to `umlaute/eszett - äöüß` — jeb, Nov 20 '19 at 17:00
Could you add `repr(u.encode('utf-8'))` to the prints to make sure we can see that this is really the case? — tripleee, Nov 20 '19 at 17:10
The first traceback is incomplete, does it have the same error message? — tripleee, Nov 21 '19 at 12:42
I would speculate that this works as designed, and that the older version was just sloppier in what it allowed. Is the output somehow redirected to a file or etc where the `sys.default.encoding` doesn't actually matter, or is it actually connected to standard output? (If so, it's a bug, and not working as designed.) The proper workaround is probably to upgrade to Python 3, I'm afraid. — tripleee, Nov 21 '19 at 14:49

score 1 · Answer 1 · edited Apr 23 '20 at 16:14

One of the biggest changes in Python3 is the use of unicode strings by default.

If it is in your power to change the files were the problem occurs, you can enhance the text-behavior of your program by backporting the unicode-by-default nature to your Python2 code adding from __future__ import unicode_literals (I'd also suggest switching to the much nicer print as a function with (from __future__ import print_function)

In doing that, you will have to watch the places were your code output text back to the "outside world" - all print, log, database and file write calls: these may require that you send byte-strings. All ou have to do is to place a manual encoding at these points:

   logger.info("device name {}".format(device_name).encode("utf-8")

(the print function, however, can handle unicode-strings and will automatically use the guessed terminal encoding to do its output).

TL;DR: Always have all the text in your program as unicode objects. All of it, even string literals - and just decode from bytes, and encode back to bytes at the interfacing of your system with external components (any I/O).

This may be termed the "unicode sandwich" - and can eliminate 97 in 100 encoding headaches. (you may have to spend some time finding "which" encoding you need - but you will know exactly where to place the decode (bytes-to-text): at any function getting data into your program, and encode (text-to-bytes): any function getting data out of your program)

Different unicode handling python2.7.9 vs 2.7.15

1 Answers1