Tracking down implicit unicode conversions in Python 2

Question

I have a large project where at various places problematic implicit Unicode conversions (coersions) were used in the form of e.g.:

someDynamicStr = "bar" # could come from various sources

# works
u"foo" + someDynamicStr
u"foo{}".format(someDynamicStr)

someDynamicStr = "\xff" # uh-oh

# raises UnicodeDecodeError
u"foo" + someDynamicStr
u"foo{}".format(someDynamicStr)

(Possibly other forms as well.)

Now I would like to track down those usages, especially those in actively used code.

It would be great if I could easily replace the unicode constructor with a wrapper which checks whether the input is of type str and the encoding/errors parameters are set to the default values and then notifies me (prints traceback or such).

/edit:

While not directly related to what I am looking for I came across this gloriously horrible hack for how to make the decode exception go away altogether (the decode one only, i.e. str to unicode, but not the other way around, see https://mail.python.org/pipermail/python-list/2012-July/627506.html).

I don't plan on using it but it might be interesting for those battling problems with invalid Unicode input and looking for a quick fix (but please think about the side effects):

import codecs
codecs.register_error("strict", codecs.ignore_errors)
codecs.register_error("strict", lambda x: (u"", x.end)) # alternatively

(An internet search for codecs.register_error("strict" revealed that apparently it's used in some real projects.)

/edit #2:

For explicit conversions I made a snippet with the help of a SO post on monkeypatching:

class PatchedUnicode(unicode):
  def __init__(self, obj=None, encoding=None, *args, **kwargs):
    if encoding in (None, "ascii", "646", "us-ascii"):
        print("Problematic unicode() usage detected!")
    super(PatchedUnicode, self).__init__(obj, encoding, *args, **kwargs)

import __builtin__
__builtin__.unicode = PatchedUnicode

This only affects explicit conversions using the unicode() constructor directly so it's not something I need.

/edit #3:

The thread "Extension method for python built-in types!" makes me think that it might actually not be easily possible (in CPython at least).

/edit #4:

It's nice to see many good answers here, too bad I can only give out the bounty once.

In the meantime I came across a somewhat similar question, at least in the sense of what the person tried to achieve: Can I turn off implicit Python unicode conversions to find my mixed-strings bugs? Please note though that throwing an exception would not have been OK in my case. Here I was looking for something which might point me to the different locations of problematic code (e.g. by printing smth.) but not something which might exit the program or change its behavior (because this way I can prioritize what to fix).

On another note, the people working on the Mypy project (which include Guido van Rossum) might also come up with something similar helpful in the future, see the discussions at https://github.com/python/mypy/issues/1141 and more recently https://github.com/python/typing/issues/208.

/edit #5

I also came across the following but didn't have yet the time to test it: https://pypi.python.org/pypi/unicode-nazi

@PadraicCunningham Assuming that it's C code I guess locating, showing what I would have to change there (e.g. how to call Python code again from there if that's possible) and how to recompile everything back into a custom build would help me. But I would hope that a simpler way exists. My end goal is simply to have a way to detect all the problematic implicit `unicode` conversions which might lead to `UnicodeDecodeError`s. — phk, Sep 25 '16 at 15:45
You _might_ be able to do something with the `sys.settrace` and a custom trace function. I played around with it for a few minutes, and could see the errant call to `decode`, but couldn't figure out a way to check the type of the argument. https://pymotw.com/2/sys/tracing.html — Lucas Wiman, Sep 26 '16 at 22:39
@RecursivelyIronic Sounded promising, I tried to do the same as you but for me no `decode` calls would even show up. It could have to do something with my Python build as the documentation mentions. — phk, Sep 27 '16 at 16:22

score 4 · Accepted Answer · answered Sep 28 '16 at 02:12

You can register a custom encoding which prints a message whenever it's used:

Code in ourencoding.py:

import sys
import codecs
import traceback

# Define a function to print out a stack frame and a message:

def printWarning(s):
    sys.stderr.write(s)
    sys.stderr.write("\n")
    l = traceback.extract_stack()
    # cut off the frames pointing to printWarning and our_encode
    l = traceback.format_list(l[:-2])
    sys.stderr.write("".join(l))

# Define our encoding:

originalencoding = sys.getdefaultencoding()

def our_encode(s, errors='strict'):
    printWarning("Default encoding used");
    return (codecs.encode(s, originalencoding, errors), len(s))

def our_decode(s, errors='strict'):
    printWarning("Default encoding used");
    return (codecs.decode(s, originalencoding, errors), len(s))

def our_search(name):
    if name == 'our_encoding':
        return codecs.CodecInfo(
            name='our_encoding',
            encode=our_encode,
            decode=our_decode);
    return None

# register our search and set the default encoding:
codecs.register(our_search)
reload(sys)
sys.setdefaultencoding('our_encoding')

If you import this file at the start of our script, then you'll see warnings for implicit conversions:

#!python2
# coding: utf-8

import ourencoding

print("test 1")
a = "hello " + u"world"

print("test 2")
a = "hello ☺ " + u"world"

print("test 3")
b = u" ".join(["hello", u"☺"])

print("test 4")
c = unicode("hello ☺")

output:

test 1
test 2
Default encoding used
 File "test.py", line 10, in <module>
   a = "hello ☺ " + u"world"
test 3
Default encoding used
 File "test.py", line 13, in <module>
   b = u" ".join(["hello", u"☺"])
test 4
Default encoding used
 File "test.py", line 16, in <module>
   c = unicode("hello ☺")

It's not perfect as test 1 shows, if the converted string only contain ASCII characters, sometimes you won't see a warning.

Dang... I did not refresh my browser, so I only saw your answer after I posted mine. But on the other hand, mine works for all of your test cases. — Dakkaron, Sep 28 '16 at 12:53
@Dakkaron That appears to be system-dependent, on my system (windows 10) test 1 doesn't produce any logging either with your answer. — roeland, Sep 29 '16 at 05:04
I tracked down what happened. The difference is not the platform, but whether that's in a script that you run with python or if it's run in the interactive shell. Python concats the string and unicode in test 1 at load time, not at run-time. So that happens, before you change the encoding. The same thing happens if you define a function with that content. Then it gets concatted when the function is defined, not when it's run. — Dakkaron, Sep 29 '16 at 11:41

score 2 · Answer 2 · answered Sep 28 '16 at 12:52

What you can do is the following:

First create a custom encoding. I will call it "lascii" for "logging ASCII":

import codecs
import traceback

def lascii_encode(input,errors='strict'):
    print("ENCODED:")
    traceback.print_stack()
    return codecs.ascii_encode(input)


def lascii_decode(input,errors='strict'):
    print("DECODED:")
    traceback.print_stack()
    return codecs.ascii_decode(input)

class Codec(codecs.Codec):
    def encode(self, input,errors='strict'):
        return lascii_encode(input,errors)
    def decode(self, input,errors='strict'):
        return lascii_decode(input,errors)

class IncrementalEncoder(codecs.IncrementalEncoder):
    def encode(self, input, final=False):
        print("Incremental ENCODED:")
        traceback.print_stack()
        return codecs.ascii_encode(input)

class IncrementalDecoder(codecs.IncrementalDecoder):
    def decode(self, input, final=False):
        print("Incremental DECODED:")
        traceback.print_stack()
        return codecs.ascii_decode(input)

class StreamWriter(Codec,codecs.StreamWriter):
    pass

class StreamReader(Codec,codecs.StreamReader):
    pass

def getregentry():
    return codecs.CodecInfo(
        name='lascii',
        encode=lascii_encode,
        decode=lascii_decode,
        incrementalencoder=IncrementalEncoder,
        incrementaldecoder=IncrementalDecoder,
        streamwriter=StreamWriter,
        streamreader=StreamReader,
    )

What this does is basically the same as the ASCII-codec, just that it prints a message and the current stack trace every time it encodes or decodes from unicode to lascii.

Now you need to make it available to the codecs module so that it can be found by the name "lascii". For this you need to create a search function that returns the lascii-codec when it's fed with the string "lascii". This is then registered to the codecs module:

def searchFunc(name):
    if name=="lascii":
        return getregentry()
    else:
        return None

codecs.register(searchFunc)

The last thing that is now left to do is to tell the sys module to use 'lascii' as default encoding:

import sys
reload(sys) # necessary, because sys.setdefaultencoding is deleted on start of Python
sys.setdefaultencoding('lascii')

Warning: This uses some deprecated or otherwise unrecommended features. It might not be efficient or bug-free. Do not use in production, only for testing and/or debugging.

ElmoVanKielmo · Answer 3 · 2016-10-01T10:32:13.343

Just add:

from __future__ import unicode_literals

at the beginning of your source code files - it has to be the first import and it has to be in all source code files affected and the headache of using unicode in Python-2.7 goes away. If you didn't do anything super weird with strings then it should get rid of the problem out of the box.
Check out the following Copy&Paste from my console - I tried with the sample from your question:

user@linux2:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> someDynamicStr = "bar" # could come from various sources

>>>
>>> # works
... u"foo" + someDynamicStr
u'foobar'
>>> u"foo{}".format(someDynamicStr)
u'foobar'
>>>
>>> someDynamicStr = "\xff" # uh-oh
>>>
>>> # raises UnicodeDecodeError
... u"foo" + someDynamicStr
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
uUnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
">>> u"foo{}".format(someDynamicStr)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
>>>

And now with __future__ magic:

user@linux2:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from __future__ import unicode_literals
>>> someDynamicStr = "bar" # could come from various sources
>>>
>>> # works
... u"foo" + someDynamicStr
u'foobar'
>>> u"foo{}".format(someDynamicStr)
u'foobar'
>>>
>>> someDynamicStr = "\xff" # uh-oh
>>>
>>> # raises UnicodeDecodeError
... u"foo" + someDynamicStr
u'foo\xff'
>>> u"foo{}".format(someDynamicStr)
u'foo\xff'
>>>

Interesting so what exactly does that do? From what I gather it changes Python 2 to be more like Python 3, so it's as if `u` was the default prefix, thus turning all `str`s (in Python 3 the equivalent would be `bytes`) into `unicode` (which is `str` in Python 3), i.e. the output of `type("")` changes from `` to ``. To get then `str`s you have to use the `b""` prefix. Interestingly it does not change the constructors or any of the `decode()`/`encode()` functions. — phk, Oct 01 '16 at 18:33
While I think this magic might be great for new projects that are set out to run on both Python 2 and Python 3, I am afraid that possible side effects might come to bite me in this particular project. There is too much explicit decoding/encoding, writing/reading to files and converting to different formats like JSON and BER already happening. See also the drawback section at http://python-future.org/unicode_literals.html and the following [so] thread: http://stackoverflow.com/q/809796/2261442 — phk, Oct 01 '16 at 18:35
I didn't write it in the answer but this actually fixed problems with legacy code for me. I don't know about BER but it never caused problems with JSON for me. Maybe give it a try? — ElmoVanKielmo, Oct 02 '16 at 00:37

score -3 · Answer 4 · answered Sep 25 '16 at 15:31

-3

I see you have a lot of edits relating to solutions you may have encountered. I'm just going to address your original post which I believe to be: "I want to create a wrapper around the unicode constructor that checks input".

The unicode method is part of Python's standard library. You will decorate the unicode method to add checks to the method.

def add_checks(fxn):
    def resulting_fxn(*args, **kargs):
        # this is where whether the input is of type str
        if type(args[0]) is str:
            # do something
        # this is where the encoding/errors parameters are set to the default values
        encoding = 'utf-8'

        # Set default error behavior
        error = 'ignore'

        # Print any information (i.e. traceback)
        # print 'blah'
        # TODO: for traceback, you'll want to use the pdb module
        return fxn(args[0], encoding, error)
    return resulting_fxn

Using this will look like this:

unicode = add_checks(unicode)

We overwrite the existing function name so that you don't have to change all the calls in the large project. You want to do this very early on in the runtime so that subsequent calls have the new behavior.

answered Sep 25 '16 at 15:31

nbui

182
1
9

1

I'm looking for a solution for **implicit** conversions, one for explicit conversions I presented already in my original post. – phk Sep 25 '16 at 15:37
For the examples you gave me, Python throws errors. They don't even run, let alone are correct. For them to be in the codebase is the root of the problem. If you want to edit, what is essentially, how the `u` prefix is parsed, that's going to be more work than writing a good `sed` or `regex` to change all implicit conversions to explicit and then use a min. of 2 solutions you already have. So I encourage using explicit conversions and extending the `unicode` constructor. – nbui Sep 25 '16 at 16:17
`someStr` is dynamic, as long as there are only ASCII characters in there the code runs OK. `someStr` might come from a file, user input, output of an external process, … – phk Sep 25 '16 at 16:19
And I doubt you can create a RegEx for tracking down all but the simplest implicit conversions. – phk Sep 25 '16 at 16:28

Tracking down implicit unicode conversions in Python 2

4 Answers4

Linked