1

All my python source code is encoded in utf-8 and has this coding declared on the top of the file.

But sometimes the u before a unicode string is missing.

Example Umlauts = "üöä"

Above is a bytestring containing non-ascii characters and this makes trouble (UnicodeDecodeError).

I tried pylint and python -3 but I could not get a warning.

I search an automated way to find non-ascii characters in bytestrings.

My source code needs to support Python 2.6 and Python 2.7.

I get this well known error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 7: ordinal not in range(128)

BTW: This question is only about python source code, not about strings read from files or sockets.

Solution

  • for projects which need to support Python 2.6+ I will use __future__.unicode_literals
  • for projects which need to support 2.5 I will use the solution from thg435 (module ast)
guettli
  • 25,042
  • 81
  • 346
  • 663
  • 1
    Could you elaborate on "makes trouble"? – Jon Clements Sep 28 '12 at 09:33
  • 1
    Finding those strings and sticking a `u` in front of them is not going to solve your problem. This error appears whenever you *do* something with your data (like `print`ing) where the accepting function doesn't expect characters encoded that way. You need to make sure that all strings in your program are handled as Unicode as soon and as long as possible and only encoded to specific, matching encodings when exporting/printing etc. – Tim Pietzcker Sep 28 '12 at 09:44
  • 3
    First of all I **love** `__future__.unicode_literals`. Second: To find those I would probably try using `grep` like in [this example](http://stackoverflow.com/questions/3001177/how-do-i-grep-for-non-ascii-characters-in-unix). Of course this will find those characters _out of a bytestring_ too, but I assume theres's not many variables with umlaut names is it? – javex Sep 28 '12 at 09:45
  • @javex: Good point; it's devilishly hard to match all forms of strings in Python with regexes (think of strings like `"""'"'\""\n'''"""`)... – Tim Pietzcker Sep 28 '12 at 09:51
  • @TimPietzcker: correct, thats why you just search for a specific byte range. That will just find **any** non-ascii characters. Then you can change those that need a change. – javex Sep 28 '12 at 09:53

1 Answers1

2

Of course you want to use python for this!

import ast, re

with open("your_script.py") as fp:
    tree = ast.parse(fp.read())

for node in ast.walk(tree):
    if (isinstance(node, ast.Str) 
            and isinstance(node.s, str) 
            and  re.search(r'[\x80-\xFF]', node.s)):
        print 'bad string %r line %d col %d' % (node.s, node.lineno, node.col_offset)

Note that this doesn't distinguish between bare and escaped non-ascii chars (fuß and fu\xdf).

georg
  • 211,518
  • 52
  • 313
  • 390