I am using the tokenizer from NLTK in Python.
There are whole bunch of answers for removing punctuations on the forum already. However, none of them address all of the following issues together:
- More than one symbol in a row. For example, the sentence: He said,"that's it." Because there's a comma followed by quotation mark, the tokenizer won't remove ." in the sentence. The tokenizer will give ['He', 'said', ',"', 'that', 's', 'it.'] instead of ['He','said', 'that', 's', 'it']. Some other examples include '...', '--', '!?', ',"', and so on.
- Remove symbol at the end of the sentence. i.e. the sentence: Hello World. The tokenizer will give ['Hello', 'World.'] instead of ['Hello', 'World']. Notice the period at the end of the word 'World'. Some other examples include '--',',' in the beginning, middle, or end of any character.
- Remove characters with symbols in front and after. i.e.
'*u*', '''','""'
Is there an elegant way of solving both problems?