4

A very common source of encoding errors is that python 2 will silently coerce strings to unicode when you add them together with unicode. This can cause mixed encoding problems and can be very hard to debug.

For example:

import urllib
import webbrowser
name = raw_input("What's your name?\nName: ")
greeting = "Hello, %s" % name
if name == "John":
    greeting += u' (Feliz cumplea\xf1os!)'
webbrowser.open('http://lmgtf\x79.com?q=' + urllib.quote_plus(greeting))

will fail with a cryptic error if you enter "John":

/usr/lib/python2.7/urllib.py:1268: UnicodeWarning: Unicode equal comparison faile
d to convert both arguments to Unicode - interpreting them as being unequal
  return ''.join(map(quoter, s))
Traceback (most recent call last):
  File "feliz.py", line 7, in <module>
    webbrowser.open('http://lmgtf\x79.com?q=' + urllib.quote_plus(greeting))
  File "/usr/lib/python2.7/urllib.py", line 1273, in quote_plus
    s = quote(s, safe + ' ')
  File "/usr/lib/python2.7/urllib.py", line 1268, in quote
    return ''.join(map(quoter, s))
KeyError: u'\xf1'

It's particularly hard to track down when the actual errors come far down the road from where the actual coercion happened.

How can you configure python to give a warning or exception immediately when strings are coerced to unicode?

Mu Mind
  • 10,935
  • 4
  • 38
  • 69
  • Hrm... I had to mangle the "lmgtf y.com" URL because SO wouldn't let me post it otherwise... – Mu Mind Sep 24 '12 at 00:34
  • 1
    AFAIK this isn't configurable, unfortunately. [And I think that particular site was blocked because it was really only used as a snarky rickroll. There are a few discussion on meta.] – DSM Sep 24 '12 at 00:56
  • Why are you not using `unicode` literals in the first place, and encoding them when it matters? – Ignacio Vazquez-Abrams Sep 24 '12 at 00:58
  • 2
    I think there's some confusion here. From a recent answer, I think @MuMind knows how unicode works, and is asking if there was a way to get Python 3-style automatic-coercion refusal in Python 2. I suspect this was motivated by a [recent question](http://stackoverflow.com/questions/12556839/is-there-an-easy-way-to-make-unicode-work-in-python) where the asker seems to have gotten himself into trouble in ways it'd be harder to do in 3. – DSM Sep 24 '12 at 01:06
  • 2
    Nice question. I think this should have been worth a Python 2.x feature, would any still be added to that branch. - a command line switch to raise errors on mixing both types. – jsbueno Sep 24 '12 at 01:53
  • 1
    Especially since it's so important for transitioning to python 3, and that's kinda the point of the python 2.7 branch. At the very least, the unicode-nazi tool should get more publicity. – Mu Mind Sep 24 '12 at 02:01

2 Answers2

4

I did a little more research after asking this question and hit on the perfect answer. Armin Ronacher created a wonderful little tool called unicode-nazi. Just install it and run your program like this:

python -Werror -municodenazi myprog.py

and you get a traceback right where the coercion happened:

Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "SITE-PACKAGES/unicodenazi.py", line 128, in <module>
    main()
  File "SITE-PACKAGES/unicodenazi.py", line 119, in main
    execfile(sys.argv[0], main_mod.__dict__)
  File "myprog.py", line 4, in <module>
    print foo()
  File "myprog.py", line 2, in foo
    return 'bar' + u'baz'
  File "SITE-PACKAGES/unicodenazi.py", line 34, in warning_decode
    stacklevel=2)
UnicodeWarning: Implicit conversion of str to unicode

If you're dealing with python libraries that trigger implicit coercions themselves and you can't catch the exceptions or otherwise work around them, you can leave out the -Werror:

python -municodenazi myprog.py

and at least see a warning printed out on stderr when it happens:

/SITE-PACKAGES/unicodenazi.py:119: UnicodeWarning: Implicit conversion of str to unicode
  execfile(sys.argv[0], main_mod.__dict__)
barbaz
Mu Mind
  • 10,935
  • 4
  • 38
  • 69
0

That error isn't cryptic at all. I can gather from it that urllib.quote() (with is called by quote_plus()) doesn't handle unicode very well. Some quick googling and I've found this previous SO question asking for unicode safe alternatives. Unfortunately, none seem to exist.

Community
  • 1
  • 1
acattle
  • 3,073
  • 1
  • 16
  • 21