find non ascii bytestrings in python source code

Question

All my python source code is encoded in utf-8 and has this coding declared on the top of the file.

But sometimes the u before a unicode string is missing.

Example Umlauts = "üöä"

Above is a bytestring containing non-ascii characters and this makes trouble (UnicodeDecodeError).

I tried pylint and python -3 but I could not get a warning.

I search an automated way to find non-ascii characters in bytestrings.

My source code needs to support Python 2.6 and Python 2.7.

I get this well known error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 7: ordinal not in range(128)

BTW: This question is only about python source code, not about strings read from files or sockets.

Solution

for projects which need to support Python 2.6+ I will use __future__.unicode_literals
for projects which need to support 2.5 I will use the solution from thg435 (module ast)

Finding those strings and sticking a `u` in front of them is not going to solve your problem. This error appears whenever you *do* something with your data (like `print`ing) where the accepting function doesn't expect characters encoded that way. You need to make sure that all strings in your program are handled as Unicode as soon and as long as possible and only encoded to specific, matching encodings when exporting/printing etc. — Tim Pietzcker, Sep 28 '12 at 09:44
First of all I **love** `__future__.unicode_literals`. Second: To find those I would probably try using `grep` like in [this example](http://stackoverflow.com/questions/3001177/how-do-i-grep-for-non-ascii-characters-in-unix). Of course this will find those characters _out of a bytestring_ too, but I assume theres's not many variables with umlaut names is it? — javex, Sep 28 '12 at 09:45
@javex: Good point; it's devilishly hard to match all forms of strings in Python with regexes (think of strings like `"""'"'\""\n'''"""`)... — Tim Pietzcker, Sep 28 '12 at 09:51
@TimPietzcker: correct, thats why you just search for a specific byte range. That will just find **any** non-ascii characters. Then you can change those that need a change. — javex, Sep 28 '12 at 09:53

georg · Accepted Answer · 2012-09-28T13:01:55.957

Of course you want to use python for this!

import ast, re

with open("your_script.py") as fp:
    tree = ast.parse(fp.read())

for node in ast.walk(tree):
    if (isinstance(node, ast.Str) 
            and isinstance(node.s, str) 
            and  re.search(r'[\x80-\xFF]', node.s)):
        print 'bad string %r line %d col %d' % (node.s, node.lineno, node.col_offset)

Note that this doesn't distinguish between bare and escaped non-ascii chars (fuß and fu\xdf).

find non ascii bytestrings in python source code

1 Answers1