1

I need to remove punctuation from a unicode string. I've read a few posts and the most recommended one was this one.

I've implemented the following:

table = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))

def tokenize(message):
    message = unicode(message,'utf-8').lower()
    #print message
    message = remove_punctuation_unicode(message)
    return message

def remove_punctuation_unicode(string):
    return string.translate(table)

But when I run the code, this error pops up:

table = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
TypeError: must be unicode, not str

I can't quite figure it out what to do. Can someone tell me how to fix this?

Community
  • 1
  • 1
Krishh
  • 602
  • 1
  • 8
  • 25

1 Answers1

2

Try unichr instead of chr:

Python 2.7.10 (default, Oct 14 2015, 16:09:02) 
[GCC 5.2.1 20151010] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys, unicodedata
>>> table = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(unichr(i)).startswith('P'))
>>> 
Yurim
  • 923
  • 1
  • 17
  • 34
  • @KrishanuKonar why did you use `chr` if [the post that you've linked](http://stackoverflow.com/a/11066687/5374161) uses `unichr`? – jfs Apr 16 '16 at 13:06
  • honestly, I read it once, understood it and typed it myself, and I definitely typed it wrong, stupid mistake. my bad. – Krishh Apr 16 '16 at 13:51