I have a question about string.punctuation.
I'm using NLTK and I need to clear my text from punctuation (the text is already divided in tokens with function word_tokenize(my_str)
).
I wrote simple functions to do the work, but after calling these functions, I see that double quotes tokens remain! The others, like comma, full stop and other special is correctly clear, but not double quote. Why? If I print string.punctuation in Python interpreter I read the list of char considered punctuation, so also double quote:
>>>import string
>>>print string.punctuation
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
My functions are this:
def is_punct_char(char):
return char in string.punctuation
def is_not_punct_char(char):
return not is_punct_char(char)
# clear punct:
#\par: lista di token
#\return: lista di bigrammi (token, PoS)
def erase_punct(token_list):
return filter(is_not_punct_char, token_list)
The original text is:
Hello, how are you? I'm ok, thanks. And you? Not very "well".
After tokenization the output is:
[u'Hello', u',', u'how', u'are', u'you', u'?', u'I', u"'m", u'ok', u',', u'thanks', u'.', u'And', u'you', u'?', u'Not', u'very', u'``', u'well', u"''", u'.']
After clear from punctuation the output is:
[u'Hello', u'how', u'are', u'you', u'I', u"'m", u'ok', u'thanks', u'And', u'you', u'Not', u'very', u'``', u'well', u"''"]
That is not correct. As last token I expected u'well'
, not the two "around it" (u'``'
and u"''"
).
Anyone could help me ?