6

When profiling our code I was surprised to find millions of calls to
C:\Python26\lib\encodings\utf_8.py:15(decode)

I started debugging and found that across our code base there are many small bugs, usually comparing a string to a unicode or adding a sting and a unicode. Python graciously decodes the strings and performs the following operations in unicode.

How kind. But expensive!

I am fluent in unicode, having read Joel Spolsky and Dive Into Python...

I try to keep our code internals in unicode only.

My question - can I turn off this pythonic nice-guy behavior? At least until I find all these bugs and fix them (usually by adding a u'u')?

Some of them are extremely hard to find (a variable that is sometimes a string...).

Python 2.6.5 (and I can't switch to 3.x).

Tal Weiss
  • 8,889
  • 8
  • 54
  • 62

1 Answers1

9

The following should work:

>>> import sys
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding('undefined')
>>> u"abc" + u"xyz"
u'abcxyz'
>>> u"abc" + "xyz"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/encodings/undefined.py", line 22, in decode
    raise UnicodeError("undefined encoding")
UnicodeError: undefined encoding

reload(sys) in the snippet above is only necessary here since normally sys.setdefaultencoding is supposed to go in a sitecustomize.py file in your Python site-packages directory (it's advisable to do that).

ChristopheD
  • 112,638
  • 29
  • 165
  • 179
  • Oh wow. I love it. Can you explain a little more how the `reload()` does its magic? How and why does it nullify the `sitecustomize.py` setting? – jcdyer May 17 '10 at 19:15
  • 2
    On my Apple Python 2.6 build (but I've seen this elsewhere...) `site.py` (in your std python lib dir; executed once automagically at Python startup) contains (near the end): `if hasattr(sys, "setdefaultencoding"): del sys.setdefaultencoding`. This makes this attribute unavailable on `sys` unless you explicitely choose to `reload(sys)` (or uncomment the deleting). It used to be available directly in earlier Pythons iirc. – ChristopheD May 17 '10 at 19:27
  • 1
    Very cool - thank you! Pydev and Pylint hate you, but it works! ...and I found a truckload of "bugs" in a few minutes, some of them in the Python source code! (They are not exactly bugs because the code works, it just works a little better after I fix it). CSV files: split(u'\t') needed the little 'u'. Dictionary keys are not exactly unicode in 2.6... - who would have thunk?!?! Thank you! – Tal Weiss May 18 '10 at 20:57