How to find a non-ascii byte in my code?

Question

While making my App Engine app I suddenly ran into an error which shows every couple of requests:

    run_wsgi_app(application)
  File "/home/ubuntu/Programs/google/google_appengine/google/appengine/ext/webapp/util.py", line 98, in run_wsgi_app
    run_bare_wsgi_app(add_wsgi_middleware(application))
  File "/home/ubuntu/Programs/google/google_appengine/google/appengine/ext/webapp/util.py", line 118, in run_bare_wsgi_app
    for data in result:
  File "/home/ubuntu/Programs/google/google_appengine/google/appengine/ext/appstats/recording.py", line 897, in appstats_wsgi_wrapper
    result = app(environ, appstats_start_response)
  File "/home/ubuntu/Programs/google/google_appengine/google/appengine/ext/webapp/_webapp25.py", line 717, in __call__
    handler.handle_exception(e, self.__debug)
  File "/home/ubuntu/Programs/google/google_appengine/google/appengine/ext/webapp/_webapp25.py", line 463, in handle_exception
    self.error(500)
  File "/home/ubuntu/Programs/google/google_appengine/google/appengine/ext/webapp/_webapp25.py", line 436, in error
    self.response.clear()
  File "/home/ubuntu/Programs/google/google_appengine/google/appengine/ext/webapp/_webapp25.py", line 288, in clear
    self.out.seek(0)
  File "/usr/lib/python2.7/StringIO.py", line 106, in seek
    self.buf += ''.join(self.buflist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 208: ordinal not in range(128)

I really have no idea where this could be, it only happens when I use a specific function but it's impossible to track all string I have. It's possible this byte is a character like ' " [ ] etc, but only in another language

How can I find this byte and possibly other ones?

I am running GAE with python 2.7 in ubuntu 11.04

Thanks.

*updated*

This is the code I ended up using: from codecs import BOM_UTF8 from os import listdir, path p = "path"

def loopPath(p, times=0):
    for fname in listdir(p):
        filePath = path.join(p, fname)
        if path.isdir(filePath):
            return loopPath(filePath, times+1)

        if fname.split('.', 1)[1] != 'py': continue

        f = open(filePath, 'r')
        ln = 0
        for line in f:
            #print line[:3] == BOM_UTF8
            if not ln and line[:3] == BOM_UTF8:
                line = line[4:]
            col = 0
            for c in list(line):
                if ord(c) > 128:
                    raise Exception('Found "'+line[c]+'" line %d column %d in %s' % (ln+1, col, filePath))
                col += 1
            ln += 1
        f.close()

loopPath(p)

Have you tried looking at byte `'\0xd7'` at position `208` of the buffer as the error obviously pointed out? — Jeff Mercado, Sep 17 '11 at 18:05
That buffer is an internal variable of the StringIO which is really deep inside GAE's code. and the buffer won't show me exactly where it is in my code, only a lot of text... — Shedokan, Sep 17 '11 at 21:22

score 4 · Accepted Answer · answered Sep 17 '11 at 19:40

4

Just goes through every character in each line of code. Something like that:

# -*- coding: utf-8 -*-
import sys

data = open(sys.argv[1])
line = 0
for l in data:
    line += 1
    char = 0
    for s in list(unicode(l,'utf-8')):
        char += 1
        try:
            s.encode('ascii')
        except:
            print 'Non ASCII character at line:%s char:%s' % (line,char)

answered Sep 17 '11 at 19:40

Andrey Nikishaev

3,759
5
40
55

it can be simplified by using ord(s) > 128 – Shedokan Sep 17 '11 at 21:34

score 1 · Answer 2 · answered Sep 17 '11 at 19:59

When I translated UTF-8 files to latin1 LaTeX I had similar problems. I wanted a list of all evil unicode characters in my files.

It is probably even more you need, but I used this:

UNICODE_ERRORS = {}

def fortex(exc):
    import unicodedata, exceptions
    global UNICODE_ERRORS
    if not isinstance(exc, exceptions.UnicodeEncodeError):
        raise TypeError("don't know how to handle %r" % exc)
    l = []
    print >>sys.stderr, "   UNICODE:", repr(exc.object[max(0,exc.start-20):exc.end+20])
    for c in exc.object[exc.start:exc.end]:
        uname = unicodedata.name(c, u"0x%x" % ord(c))
        l.append(uname)
        key = repr(c)
        if not UNICODE_ERRORS.has_key(key): UNICODE_ERRORS[key] = [ 1, uname ]
        else: UNICODE_ERRORS[key][0] += 1
    return (u"\\gpTastatur{%s}" % u", ".join(l), exc.end)

def main():    
    codecs.register_error("fortex", fortex)
    ...
    fileout = codecs.open(filepath, 'w', DEFAULT_CHARSET, 'fortex')
    ...
    print UNICODE_ERROS

helpful?

Here is the matching excerpt from the Python doc:

codecs.register_error(name, error_handler) Register the error handling function error_handler under the name name. error_handler will be called during encoding and decoding in case of an error, when name is specified as the errors parameter.

For encoding error_handler will be called with a UnicodeEncodeError instance, which contains information about the location of the error. The error handler must either raise this or a different exception or return a tuple with a replacement for the unencodable part of the input and a position where encoding should continue. The encoder will encode the replacement and continue encoding the original input at the specified position. Negative position values will be treated as being relative to the end of the input string. If the resulting position is out of bound an IndexError will be raised.

+1 for saving all errors, but I don't see if it saves where the error was found — Shedokan, Sep 17 '11 at 21:34

score 0 · Answer 3 · edited May 23 '17 at 12:27

0

You can use the command:

grep --color='auto' -P -n "[\x80-\xFF]" file.xml

This will give you the line number, and will highlight non-ascii chars in red.

Copied from How do I grep for all non-ASCII characters in UNIX. Fredrik's answer is good but not quite right because it also finds ASCII chars that are not alphanumeric.

edited May 23 '17 at 12:27

Community

1
1

answered Nov 05 '12 at 15:07

Brian

1,351
2
15
29

score 0 · Answer 4 · answered Nov 12 '12 at 01:11

0

This Python script gives the offending character and its index in the text when that text is viewed as a single line:

[(index, char) for (index, char) in enumerate(open('myfile').read()) if ord(char) > 127]

answered Nov 12 '12 at 01:11

Brian

1,351
2
15
29

Fredrik Pihl · Answer 5 · 2011-09-17T20:17:12.500

0

This should list the offending lines:

grep -v [:alnum:] dodgy_file


$ cat test
/home/ubuntu/tmp/SO/c.awk

$ cat test2
/home/ubuntu/tmp/SO/c.awk
な

$ grep -v [:alnum:] test

$ grep -v [:alnum:] test2
な

edited Sep 17 '11 at 20:17

answered Sep 17 '11 at 19:57

Fredrik Pihl

44,604
7
83
130

@shedokan - care to explain why? cmd-line can be your greatest friend! It really pays off to learn to use it. Learning some standard linux tools like grep, sed, awk, sort, uniq and connecting them using pipes gives you a tool-set that is unrivaled by any GUI-program in the world! – Fredrik Pihl Sep 17 '11 at 21:21
I'm a windows guy and used to not having to see any white on black text :) I'm only using linux for GAE nothing more. Even though they can be powerful I don't need them so no point in learning them, even though they can sometimes be useful – Shedokan Sep 17 '11 at 21:27
but you do need then, even on windows! install [cygwin](http://www.cygwin.com/); they are **VERY** useful so please reconsider your path to technical enlightenment. More tools in your utility-belt means you are better equipped to handle new challenges. Just my 10c – Fredrik Pihl Sep 17 '11 at 21:47

How to find a non-ascii byte in my code?

5 Answers5