How to remove punctuation?

Question

I am using the tokenizer from NLTK in Python.

There are whole bunch of answers for removing punctuations on the forum already. However, none of them address all of the following issues together:

More than one symbol in a row. For example, the sentence: He said,"that's it." Because there's a comma followed by quotation mark, the tokenizer won't remove ." in the sentence. The tokenizer will give ['He', 'said', ',"', 'that', 's', 'it.'] instead of ['He','said', 'that', 's', 'it']. Some other examples include '...', '--', '!?', ',"', and so on.
Remove symbol at the end of the sentence. i.e. the sentence: Hello World. The tokenizer will give ['Hello', 'World.'] instead of ['Hello', 'World']. Notice the period at the end of the word 'World'. Some other examples include '--',',' in the beginning, middle, or end of any character.
Remove characters with symbols in front and after. i.e. '*u*', '''','""'

Is there an elegant way of solving both problems?

What difficulties do you have in implementing these requirements? What issues do you have with your current version of the code? — jfs, Apr 27 '14 at 01:47
btw, there are many questions that have answers that satisfy all requirements e.g., [Remove punctation from Unicode formatted strings](http://stackoverflow.com/q/11066400/4279) — jfs, Apr 27 '14 at 01:50
How did the answers from [Best way to strip punctuation from a string in Python](http://stackoverflow.com/q/265960/4279) fail you? — jfs, Apr 27 '14 at 01:52
@J.F. Sebastian These are great links that will undoubtedly help the OP, but I think the OP is getting hung up on the third requirement. The OP states that the code should remove characters with "symbols" before or after the characters and gives as an example `'*u*'`. Thus the `u` should also go in this instance. — Justin O Barber, Apr 27 '14 at 03:01
@JustinBarber: if `u` should be removed then it is not about just removing punctuation anymore. There are two distinct tasks: remove `*u*` and remove punctuation. — jfs, Apr 27 '14 at 03:12
@J.F. Sebastian I think that's right. The title suggests that only punctuation is in view, but the language of the third requirement suggests that some characters will need to be removed. — Justin O Barber, Apr 27 '14 at 03:22
@JustinBarber: yes, `u` is just an example in my comment also. — jfs, Apr 27 '14 at 03:24
@OP, can you give an example for `*u*`? i don't think that considers as punctuation removal though. — alvas, Apr 27 '14 at 09:06

score 13 · Answer 1 · answered Apr 27 '14 at 01:46

Solution 1: Tokenize and strip punctuation off the tokens

>>> from nltk import word_tokenize
>>> import string
>>> punctuations = list(string.punctuation)
>>> punctuations
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
>>> punctuations.append("''")
>>> sent = '''He said,"that's it."'''
>>> word_tokenize(sent)
['He', 'said', ',', "''", 'that', "'s", 'it', '.', "''"]
>>> [i for i in word_tokenize(sent) if i not in punctuations]
['He', 'said', 'that', "'s", 'it']
>>> [i.strip("".join(punctuations)) for i in word_tokenize(sent) if i not in punctuations]
['He', 'said', 'that', 's', 'it']

Solution 2: remove punctuation then tokenize

>>> import string
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>> sent = '''He said,"that's it."'''
>>> " ".join("".join([" " if ch in string.punctuation else ch for ch in sent]).split())
'He said that s it'
>>> " ".join("".join([" " if ch in string.punctuation else ch for ch in sent]).split()).split()
['He', 'said', 'that', 's', 'it']

I like this approach, but I think the OP is getting hung up on the third requirement. The OP states that the code should remove characters with "symbols" before or after the characters and gives as an example `'*u*'`. Thus a `u` in such a context should be removed (probably while the asterisks still signify that the character `u` needs to go). — Justin O Barber, Apr 27 '14 at 03:04

Justin O Barber · Accepted Answer · 2014-04-27T02:42:12.597

If you want to tokenize your string all in one shot, I think your only choice will be to use nltk.tokenize.RegexpTokenizer. The following approach will allow you to use punctuation as a marker to remove characters of the alphabet (as noted in your third requirement) before removing the punctuation altogether. In other words, this approach will remove *u* before stripping all punctuation.

One way to go about this, then, is to tokenize on gaps like so:

>>> from nltk.tokenize import RegexpTokenizer
>>> s = '''He said,"that's it." *u* Hello, World.'''
>>> toker = RegexpTokenizer(r'((?<=[^\w\s])\w(?=[^\w\s])|(\W))+', gaps=True)
>>> toker.tokenize(s)
['He', 'said', 'that', 's', 'it', 'Hello', 'World']  # omits *u* per your third requirement

This should meet all three of the criteria you specified above. Note, however, that this tokenizer will not return tokens such as "A". Furthermore, I only tokenize on single letters that begin and end with punctuation. Otherwise, "Go." would not return a token. You may need to nuance the regex in other ways, depending on what your data looks like and what your expectations are.

Sorry, I clicked the check mark, but somehow it didn't go through. — user3534472, May 03 '14 at 02:49

How to remove punctuation?

2 Answers2

Linked