102

We've already gotten our code base running under Python 2.6. In order to prepare for Python 3.0, we've started adding:

from __future__ import unicode_literals

into our .py files (as we modify them). I'm wondering if anyone else has been doing this and has run into any non-obvious gotchas (perhaps after spending a lot of time debugging).

hippietrail
  • 15,848
  • 18
  • 99
  • 158
Jacob Gabrielson
  • 34,800
  • 15
  • 46
  • 64

6 Answers6

101

The main source of problems I've had working with unicode strings is when you mix utf-8 encoded strings with unicode ones.

For example, consider the following scripts.

two.py

# encoding: utf-8
name = 'helló wörld from two'

one.py

# encoding: utf-8
from __future__ import unicode_literals
import two
name = 'helló wörld from one'
print name + two.name

The output of running python one.py is:

Traceback (most recent call last):
  File "one.py", line 5, in <module>
    print name + two.name
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

In this example, two.name is an utf-8 encoded string (not unicode) since it did not import unicode_literals, and one.name is an unicode string. When you mix both, python tries to decode the encoded string (assuming it's ascii) and convert it to unicode and fails. It would work if you did print name + two.name.decode('utf-8').

The same thing can happen if you encode a string and try to mix them later. For example, this works:

# encoding: utf-8
html = '<html><body>helló wörld</body></html>'
if isinstance(html, unicode):
    html = html.encode('utf-8')
print 'DEBUG: %s' % html

Output:

DEBUG: <html><body>helló wörld</body></html>

But after adding the import unicode_literals it does NOT:

# encoding: utf-8
from __future__ import unicode_literals
html = '<html><body>helló wörld</body></html>'
if isinstance(html, unicode):
    html = html.encode('utf-8')
print 'DEBUG: %s' % html

Output:

Traceback (most recent call last):
  File "test.py", line 6, in <module>
    print 'DEBUG: %s' % html
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 16: ordinal not in range(128)

It fails because 'DEBUG: %s' is an unicode string and therefore python tries to decode html. A couple of ways to fix the print are either doing print str('DEBUG: %s') % html or print 'DEBUG: %s' % html.decode('utf-8').

I hope this helps you understand the potential gotchas when using unicode strings.

gregoltsov
  • 2,269
  • 1
  • 22
  • 37
Koba
  • 5,216
  • 2
  • 22
  • 13
  • 11
    I would suggest to go with the `decode()` solutions instead of the `str()` or `encode()` solutions: the more often you use Unicode objects, the clearer the code is, since what you want is to manipulate strings of characters, not arrays of bytes with an externally implied encoding. – Eric O. Lebigot Sep 03 '10 at 12:35
  • 8
    Please fix your terminology. `when you mix utf-8 encoded strings with unicode ones` UTF-8 and Unicode anen't 2 different encodings; Unicode is a standard and UTF-8 is one of encodings that it defines. – Kos Jun 08 '12 at 18:59
  • 11
    @Kos: I think he means mix "utf-8 encoded strings" *objects* with unicode (hence decoded) *objects*. The former is of type `str`, the latter is type `unicode`. Being different objects, problem may arise if you try to sum/concatenate/interpolate them – MestreLion Sep 28 '12 at 01:33
  • Does this apply to `python>=2.6` or `python==2.6`? – joar Jan 15 '13 at 22:20
16

Also in 2.6 (before python 2.6.5 RC1+) unicode literals doesn't play nice with keyword arguments (issue4978):

The following code for example works without unicode_literals, but fails with TypeError: keywords must be string if unicode_literals is used.

  >>> def foo(a=None): pass
  ...
  >>> foo(**{'a':1})
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
      TypeError: foo() keywords must be strings
tzot
  • 92,761
  • 29
  • 141
  • 204
mfazekas
  • 5,589
  • 1
  • 34
  • 25
13

I did find that if you add the unicode_literals directive you should also add something like:

 # -*- coding: utf-8

to the first or second line your .py file. Otherwise lines such as:

 foo = "barré"

result in an an error such as:

SyntaxError: Non-ASCII character '\xc3' in file mumble.py on line 198,
 but no encoding declared; see http://www.python.org/peps/pep-0263.html 
 for details
Jacob Gabrielson
  • 34,800
  • 15
  • 46
  • 64
  • 5
    @IanMackinnon: Python 3 assumes that files are UTF8 by default – endolith Jul 08 '12 at 01:28
  • 3
    @endolith: But Python 2 doesn't, and it will give the syntax error if you use non-ascii chars *even in comments*! So IMHO `# -*- coding: utf-8` is a virtually mandatory statement regardless if you use `unicode_literals` or not – MestreLion Sep 28 '12 at 01:39
  • The `-*-` is not required; if you were going for the emacs-compatible way, I think you'd be needing `-*- encoding: utf-8 -*-` (see the `-*-` at the end also). All you need is `coding: utf-8` (or even `=` instead of `: `). – Chris Morgan Dec 26 '12 at 04:53
  • 2
    You get this error whether or not you `from __future__ import unicode_literals`. – Flimm Apr 12 '13 at 11:36
  • 3
    Emacs compatibility [requires](https://www.gnu.org/software/emacs/manual/html_node/emacs/Specifying-File-Variables.html) ``# -*- coding: utf-8 -*-`` with "coding" (not "encoding" or "fileencoding" or anything else - Python just looks for "coding" regardless of any prefix). – Alex Dupuy Jul 15 '14 at 15:07
7

Also take into account that unicode_literal will affect eval() but not repr() (an asymmetric behavior which imho is a bug), i.e. eval(repr(b'\xa4')) won't be equal to b'\xa4' (as it would with Python 3).

Ideally, the following code would be an invariant, which should always work, for all combinations of unicode_literals and Python {2.7, 3.x} usage:

from __future__ import unicode_literals

bstr = b'\xa4'
assert eval(repr(bstr)) == bstr # fails in Python 2.7, holds in 3.1+

ustr = '\xa4'
assert eval(repr(ustr)) == ustr # holds in Python 2.7 and 3.1+

The second assertion happens to work, since repr('\xa4') evaluates to u'\xa4' in Python 2.7.

Community
  • 1
  • 1
hvr
  • 7,775
  • 3
  • 33
  • 47
  • 2
    I feel like the bigger problem here is that you're using `repr` to regenerate an object. The [`repr` documentation](https://docs.python.org/2/library/functions.html#func-repr) clearly states that this is *not* a requirement. In my opinion, this relegates `repr` to something useful only for debugging. – jpmc26 Aug 10 '14 at 17:42
5

There are more.

There are libraries and builtins that expect strings that don't tolerate unicode.

Two examples:

builtin:

myenum = type('Enum', (), enum)

(slightly esotic) doesn't work with unicode_literals: type() expects a string.

library:

from wx.lib.pubsub import pub
pub.sendMessage("LOG MESSAGE", msg="no go for unicode literals")

doesn't work: the wx pubsub library expects a string message type.

The former is esoteric and easily fixed with

myenum = type(b'Enum', (), enum)

but the latter is devastating if your code is full of calls to pub.sendMessage() (which mine is).

Dang it, eh?!?

GreenAsJade
  • 14,459
  • 11
  • 63
  • 98
  • 3
    And the type stuff also leaks into metaclasses - so in Django any strings you declare in `class Meta:` should be `b'field_name'` – Hamish Downer Sep 27 '13 at 10:32
  • 2
    Yeah ... in my case I realised that it was worth the effort to search and replace all the sendMessage strings with b' versions. If you want to avoid the dreaded "decode" exception, there is nothing like strictly using unicode in your program, converting on input and output as necessary (the "unicode sandwich" referred to in some paper I read on the topic). Overall, unicode_literals has been a big win for me... – GreenAsJade Sep 28 '13 at 09:46
1

Click will raise unicode exceptions all over the place if any module that has from __future__ import unicode_literals is imported where you use click.echo. It's a nightmare…

Sardathrion - against SE abuse
  • 17,269
  • 27
  • 101
  • 156