7

I am using the tokenizer from NLTK in Python.

There are whole bunch of answers for removing punctuations on the forum already. However, none of them address all of the following issues together:

  1. More than one symbol in a row. For example, the sentence: He said,"that's it." Because there's a comma followed by quotation mark, the tokenizer won't remove ." in the sentence. The tokenizer will give ['He', 'said', ',"', 'that', 's', 'it.'] instead of ['He','said', 'that', 's', 'it']. Some other examples include '...', '--', '!?', ',"', and so on.
  2. Remove symbol at the end of the sentence. i.e. the sentence: Hello World. The tokenizer will give ['Hello', 'World.'] instead of ['Hello', 'World']. Notice the period at the end of the word 'World'. Some other examples include '--',',' in the beginning, middle, or end of any character.
  3. Remove characters with symbols in front and after. i.e. '*u*', '''','""'

Is there an elegant way of solving both problems?

Salvador Dali
  • 214,103
  • 147
  • 703
  • 753
user3534472
  • 341
  • 3
  • 6
  • 11
  • What difficulties do you have in implementing these requirements? What issues do you have with your current version of the code? – jfs Apr 27 '14 at 01:47
  • btw, there are many questions that have answers that satisfy all requirements e.g., [Remove punctation from Unicode formatted strings](http://stackoverflow.com/q/11066400/4279) – jfs Apr 27 '14 at 01:50
  • How did the answers from [Best way to strip punctuation from a string in Python](http://stackoverflow.com/q/265960/4279) fail you? – jfs Apr 27 '14 at 01:52
  • @J.F. Sebastian These are great links that will undoubtedly help the OP, but I think the OP is getting hung up on the third requirement. The OP states that the code should remove characters with "symbols" before or after the characters and gives as an example `'*u*'`. Thus the `u` should also go in this instance. – Justin O Barber Apr 27 '14 at 03:01
  • @JustinBarber: if `u` should be removed then it is not about just removing punctuation anymore. There are two distinct tasks: remove `*u*` and remove punctuation. – jfs Apr 27 '14 at 03:12
  • @J.F. Sebastian I think that's right. The title suggests that only punctuation is in view, but the language of the third requirement suggests that some characters will need to be removed. – Justin O Barber Apr 27 '14 at 03:22
  • @JustinBarber: yes, `u` is just an example in my comment also. – jfs Apr 27 '14 at 03:24
  • @J.F. Sebastian Sorry about that. Yes, that makes sense. – Justin O Barber Apr 27 '14 at 03:27
  • @OP, can you give an example for `*u*`? i don't think that considers as punctuation removal though. – alvas Apr 27 '14 at 09:06

2 Answers2

13

Solution 1: Tokenize and strip punctuation off the tokens

>>> from nltk import word_tokenize
>>> import string
>>> punctuations = list(string.punctuation)
>>> punctuations
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
>>> punctuations.append("''")
>>> sent = '''He said,"that's it."'''
>>> word_tokenize(sent)
['He', 'said', ',', "''", 'that', "'s", 'it', '.', "''"]
>>> [i for i in word_tokenize(sent) if i not in punctuations]
['He', 'said', 'that', "'s", 'it']
>>> [i.strip("".join(punctuations)) for i in word_tokenize(sent) if i not in punctuations]
['He', 'said', 'that', 's', 'it']

Solution 2: remove punctuation then tokenize

>>> import string
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>> sent = '''He said,"that's it."'''
>>> " ".join("".join([" " if ch in string.punctuation else ch for ch in sent]).split())
'He said that s it'
>>> " ".join("".join([" " if ch in string.punctuation else ch for ch in sent]).split()).split()
['He', 'said', 'that', 's', 'it']
alvas
  • 115,346
  • 109
  • 446
  • 738
  • I like this approach, but I think the OP is getting hung up on the third requirement. The OP states that the code should remove characters with "symbols" before or after the characters and gives as an example `'*u*'`. Thus a `u` in such a context should be removed (probably while the asterisks still signify that the character `u` needs to go). – Justin O Barber Apr 27 '14 at 03:04
6

If you want to tokenize your string all in one shot, I think your only choice will be to use nltk.tokenize.RegexpTokenizer. The following approach will allow you to use punctuation as a marker to remove characters of the alphabet (as noted in your third requirement) before removing the punctuation altogether. In other words, this approach will remove *u* before stripping all punctuation.

One way to go about this, then, is to tokenize on gaps like so:

>>> from nltk.tokenize import RegexpTokenizer
>>> s = '''He said,"that's it." *u* Hello, World.'''
>>> toker = RegexpTokenizer(r'((?<=[^\w\s])\w(?=[^\w\s])|(\W))+', gaps=True)
>>> toker.tokenize(s)
['He', 'said', 'that', 's', 'it', 'Hello', 'World']  # omits *u* per your third requirement

This should meet all three of the criteria you specified above. Note, however, that this tokenizer will not return tokens such as "A". Furthermore, I only tokenize on single letters that begin and end with punctuation. Otherwise, "Go." would not return a token. You may need to nuance the regex in other ways, depending on what your data looks like and what your expectations are.

Justin O Barber
  • 11,291
  • 2
  • 40
  • 45