0

I have a question about string.punctuation.

I'm using NLTK and I need to clear my text from punctuation (the text is already divided in tokens with function word_tokenize(my_str)).

I wrote simple functions to do the work, but after calling these functions, I see that double quotes tokens remain! The others, like comma, full stop and other special is correctly clear, but not double quote. Why? If I print string.punctuation in Python interpreter I read the list of char considered punctuation, so also double quote:

 >>>import string
 >>>print string.punctuation 
    !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

My functions are this:

def is_punct_char(char):
    return char in string.punctuation

def is_not_punct_char(char):
    return not is_punct_char(char)

# clear punct: 
#\par:    lista di token
#\return: lista di bigrammi (token, PoS)
def erase_punct(token_list):
    return filter(is_not_punct_char, token_list)

The original text is:

Hello, how are you? I'm ok, thanks. And you? Not very "well".

After tokenization the output is:

[u'Hello', u',', u'how', u'are', u'you', u'?', u'I', u"'m", u'ok', u',', u'thanks', u'.', u'And', u'you', u'?', u'Not', u'very', u'``', u'well', u"''", u'.']

After clear from punctuation the output is:

[u'Hello', u'how', u'are', u'you', u'I', u"'m", u'ok', u'thanks', u'And', u'you', u'Not', u'very', u'``', u'well', u"''"]

That is not correct. As last token I expected u'well', not the two "around it" (u'``' and u"''").

Anyone could help me ?

Kyrol
  • 3,475
  • 7
  • 34
  • 46
  • And as to why your method doesn't work: your test only works for *single character strings*, but your two remaining tokens contain two characters each. – Martijn Pieters Aug 26 '14 at 13:59
  • So you told me that `"` is considering as two character ? – Kyrol Aug 26 '14 at 14:01
  • Yes, it is two `'` single quote characters. `len(u"''")` -> 2. – Martijn Pieters Aug 26 '14 at 14:05
  • Than i NEED to consider punctuation tokens, and THEN I have to delete it, because I need to pass to another function how many tokens I have (include punct tokens). After passing the num of toks I have to delete puncs tokens. So I don't think that this question is exactly as the other you address me, in my opinion. – Kyrol Aug 26 '14 at 14:06

0 Answers0