1

In the following I use translate() to eliminate punctuation from a string. I've been having a lot of problems with translate because it does not work with unicode. But now I noticed that the script works well in the development server but raises an error in the production server.

The request is sent by chrome extension to google app engine. Any suggestions how I can fix this so that the same script works in production server? Or is there another way of eliminating the punctuation without using translate().

Logs in the production server:

2011-10-11 06:18:10.384
get_rid_of_unicode: ajax: how to use xmlhttprequest
E 2011-10-11 06:18:10.384
expected a character buffer object
Traceback (most recent call last):
  File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/_webapp25.py", line 703, in __call__
    handler.post(*groups)
  File "/base/data/home/apps/ting-1/1.353888928453510037/ting.py", line 2073, in post
    user_tag_list_case = f1.striplist(main().split(" "))
  File "/base/data/home/apps/ting-1/1.353888928453510037/ting.py", line 2055, in main
    title_no_punctuation = get_rid_of_unicode.translate(None, string.punctuation)
TypeError: expected a character buffer object

Same script works with no problem in the development server:

INFO 2011-10-11 13:15:49,154 ting.py:2052] get_rid_of_unicode: how to use xmlhttprequest  
INFO 2011-10-11 13:15:49,154 ting.py:2057] title_no_punctuation: how to use xmlhttprequest

The script:

def main():

    title_lowercase = title.lower()
    title_without_possessives = remove_possessive(title_lowercase)
    title_without_double_quotes = remove_double_quotes(title_without_possessives)
    get_rid_of_unicode = title_without_double_quotes.encode('utf-8')
    title_no_punctuation = get_rid_of_unicode.translate(None, string.punctuation)
    back_to_unicode = unicode(title_no_punctuation, "utf-8")
    clean_title = remove_stop_words(back_to_unicode, f1.stop_words)
    return clean_title

user_tag_list = []
user_tag_list_case = f1.striplist(main().split(" "))
for tag in user_tag_list_case:
    user_tag_list.append(tag.lower())
Zeynel
  • 13,145
  • 31
  • 100
  • 145

1 Answers1

2

Google App Engine runs Python 2.5.2. str.translate() requires a 256-character string as the first argument; None has only been an allowed value since Python 2.6.

Wooble
  • 87,717
  • 12
  • 108
  • 131
  • @ Wooble: Thanks. I tried to use this http://stackoverflow.com/questions/1324067/how-do-i-get-str-translate-to-work-with-unicode-strings/1324274#1324274 without `None` but it gives `AssertionError` for the line `assert isinstance(to_translate, str)`. But the same works fine in IDLE so I assume this is another issue with GAE running 2.5.2. Any suggestion how I can eliminate non-letters and non-digits that is compatible with the present GAE version? Thanks again. โ€“ Zeynel Oct 11 '11 at 14:22
  • 1
    You can use [maketrans](http://docs.python.org/library/string.html#string.maketrans) to create the translation table you need to pass to translate. In your case you'd need to enumerate al the non-letters and non-digits and map them to the space character (if I understand what you're trying to do correctly). A regexp might be easier. โ€“ Luke Francl Oct 11 '11 at 15:42
  • @ Luke Franci: I tried the `maketrans` as this answer http://stackoverflow.com/questions/1324067/how-do-i-get-str-translate-to-work-with-unicode-strings/1324274#1324274 but in that case I got an `AssertionError` and with this http://stackoverflow.com/questions/1324067/how-do-i-get-str-translate-to-work-with-unicode-strings/1324461#1324461 a character like `u'ยป'` gives a TypeError `TypeError: decoding Unicode is not supported` โ€“ Zeynel Oct 11 '11 at 16:39