How to get rid of punctuation using NLTK tokenizer?

Question

I'm just starting to use NLTK and I don't quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also word_tokenize doesn't work with multiple sentences: dots are added to the last word.

Why don't you remove the punctuation yourself? `nltk.word_tokenize(the_text.translate(None, string.punctuation))` should work in python2 while in python3 you can do `nltk.work_tokenize(the_text.translate(dict.fromkeys(string.punctuation)))`. — Bakuriu, Mar 21 '13 at 12:39
The workflow assumed by NLTK is that you first tokenize into sentences and then every sentence into words. That is why `word_tokenize()` does not work with multiple sentences. To get rid of the punctuation, you can use a regular expression or python's `isalnum()` function. — Suzana, Mar 21 '13 at 12:50
It *does* work: `>>> 'with dot.'.translate(None, string.punctuation) 'with dot'`(note no dot at the end of the result) It may cause problems if you have things like `'end of sentence.No space'`, in which case do this instead: `the_text.translate(string.maketrans(string.punctuation, ' '*len(string.punctuation)))` which replaces all punctuation with white spaces. — Bakuriu, Mar 21 '13 at 12:50
Try this - http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python — Tom Ron, Mar 21 '13 at 13:40
@lizarisk With "python2" I meant python2's `str`, which is `bytes` in python3. If you use what I wrote as "python3" version it works: `the_text.translate(dict.fromkeys(string.punctuation))` removes all the (ASCII) punctuation. — Bakuriu, Mar 21 '13 at 15:57
"Why don't you remove the punctuation yourself?" If there is a "correct" way of doing something it has likely considered edge cases that you are not aware of — Att Righ, Sep 18 '22 at 18:09

score 210 · Answer 1 · edited Feb 22 '15 at 16:16

210

Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else:

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

Output:

['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']

edited Feb 22 '15 at 16:16

kgraney

1,975
16
20

answered Mar 21 '13 at 18:19

rmalouf

3,353
1
15
10

70

Note that if you use this option, you lose natural language features special to `word_tokenize` like splitting apart contractions. You can naively split on the regex `\w+` without any need for the NLTK. – sffc Jul 08 '15 at 20:31
5

To illustrate @sffc comment, you might lose words such as "Mr." – finiteautomata Oct 10 '18 at 02:51
its replacing ' n't ' to 't' how to get rid of this? – Md. Ashikur Rahman Nov 04 '19 at 18:26

score 56 · Answer 2 · edited Oct 14 '15 at 23:26

56

You do not really need NLTK to remove punctuation. You can remove it with simple python. For strings:

import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)

Or for unicode:

import string
translate_table = dict((ord(char), None) for char in string.punctuation)   
s.translate(translate_table)

and then use this string in your tokenizer.

P.S. string module have some other sets of elements that can be removed (like digits).

edited Oct 14 '15 at 23:26

Eli

36,793
40
144
207

answered Sep 20 '15 at 22:31

Salvador Dali

214,103
147
703
753

4

Remove all punctuation using the list expression that also works too. `a = "*fa,fd.1lk#$" print("".join([w for w in a if w not in string.punctuation]))` – Johnny Dec 17 '18 at 13:06
4

This approach no loner works in python >= 3.1, as the `translate` method only takes exactly one argument. Please refer to [this question](https://stackoverflow.com/questions/34293875/how-to-remove-punctuation-marks-from-a-string-in-python-3-x-using-translate) if you still want to work with the `translate` method. – RandomWalker Aug 27 '21 at 19:11

score 40 · Answer 3 · answered Dec 07 '16 at 17:51

40

Below code will remove all punctuation marks as well as non alphabetic characters. Copied from their book.

http://www.nltk.org/book/ch01.html

import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time. @ sd  4 232"

words = nltk.word_tokenize(s)

words=[word.lower() for word in words if word.isalpha()]

print(words)

output

['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']

answered Dec 07 '16 at 17:51

Madura Pradeep

2,378
1
30
34

23

Just be aware that using this method you will lose the word "not" in cases like "can't" or "don't", that may be very important for understanding and classifying the sentence. It is better using sentence.translate(string.maketrans("", "", ), chars_to_remove), where chars_to_remove can be ".,':;!?" – MikeL Feb 27 '17 at 11:24
4

@MikeL You can't get around words like "can't" and "don't" by import contractions and contractions.fix(sentence_here) before tokanizing. It will turn "can't" into "cannot" and "don't" into "do not". – zipline86 May 07 '19 at 20:48

score 19 · Answer 4 · answered Mar 21 '13 at 17:19

19

As noticed in comments start with sent_tokenize(), because word_tokenize() works only on a single sentence. You can filter out punctuation with filter(). And if you have an unicode strings make sure that is a unicode object (not a 'str' encoded with some encoding like 'utf-8').

from nltk.tokenize import word_tokenize, sent_tokenize

text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)

answered Mar 21 '13 at 17:19

palooh

360
1
6

14

Most of the complexity involved in the Penn Treebank tokenizer has to do with the proper handling of punctuation. Why use an expensive tokenizer that handles punctuation well if you're only going to strip out the punctuation? – rmalouf Mar 24 '13 at 22:33
3

`word_tokenize` is a function that returns `[token for sent in sent_tokenize(text, language) for token in _treebank_word_tokenize(sent)]`. So I think that your answer is doing what nltk already does: using `sent_tokenize()` before using `word_tokenize()`. At least this is for nltk3. – Kurt Bourbaki Jun 28 '15 at 11:27
3

@rmalouf because you don't need punctuation-only tokens ? So you want `did` and `n't` but not `.` – Ciprian Tomoiagă Dec 10 '16 at 00:30

vish · Answer 5 · 2015-05-12T01:42:29.833

13

I just used the following code, which removed all the punctuation:

tokens = nltk.wordpunct_tokenize(raw)

type(tokens)

text = nltk.Text(tokens)

type(text)  

words = [w.lower() for w in text if w.isalpha()]

edited May 12 '15 at 01:42

answered May 12 '15 at 01:10

vish

139
1
4

2

why converting tokens to text? – Sadık Oct 15 '15 at 14:55

score 13 · Answer 6 · answered Jul 02 '19 at 07:21

Sincerely asking, what is a word? If your assumption is that a word consists of alphabetic characters only, you are wrong since words such as can't will be destroyed into pieces (such as can and t) if you remove punctuation before tokenisation, which is very likely to affect your program negatively.

Hence the solution is to tokenise and then remove punctuation tokens.

import string

from nltk.tokenize import word_tokenize

tokens = word_tokenize("I'm a southern salesman.")
# ['I', "'m", 'a', 'southern', 'salesman', '.']

tokens = list(filter(lambda token: token not in string.punctuation, tokens))
# ['I', "'m", 'a', 'southern', 'salesman']

...and then if you wish, you can replace certain tokens such as 'm with am.

score 6 · Answer 7 · answered Aug 03 '16 at 05:11

I think you need some sort of regular expression matching (the following code is in Python 3):

import string
import re
import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time."
l = nltk.word_tokenize(s)
ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)]
print(l)
print(ll)

Output:

['I', 'ca', "n't", 'do', 'this', 'now', ',', 'because', 'I', "'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.']
['I', 'ca', "n't", 'do', 'this', 'now', 'because', 'I', "'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time']

Should work well in most cases since it removes punctuation while preserving tokens like "n't", which can't be obtained from regex tokenizers such as wordpunct_tokenize.

This will also remove things like `...` and `--` while preserving contractions, which `s.translate(None, string.punctuation)` won't — C.J. Jackson, Oct 03 '18 at 19:34

score 5 · Answer 8 · answered Nov 25 '19 at 06:53

5

You can do it in one line without nltk (python 3.x).

import string
string_text= string_text.translate(str.maketrans('','',string.punctuation))

answered Nov 25 '19 at 06:53

Nishan

3,644
1
32
41

score 4 · Answer 9 · edited Sep 20 '15 at 22:22

4

I use this code to remove punctuation:

import nltk
def getTerms(sentences):
    tokens = nltk.word_tokenize(sentences)
    words = [w.lower() for w in tokens if w.isalnum()]
    print tokens
    print words

getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&&B ")

And If you want to check whether a token is a valid English word or not, you may need PyEnchant

Tutorial:

 import enchant
 d = enchant.Dict("en_US")
 d.check("Hello")
 d.check("Helo")
 d.suggest("Helo")

edited Sep 20 '15 at 22:22

Salvador Dali

214,103
147
703
753

answered Sep 04 '15 at 19:03

zhenv5

289
3
5

2

Beware that this solution kills contractions. That is because `word_tokenize` use the standard tokenizer, `TreebankWordTokenizer`, that splits contractions (e.g. `can't` to (`ca`, `n't`). However `n't` is not alphanumeric and get lost in the process. – Diego Ferri Jan 21 '18 at 17:55

score 2 · Answer 10 · answered Aug 08 '19 at 19:34

2

Just adding to the solution by @rmalouf, this will not include any numbers because \w+ is equivalent to [a-zA-Z0-9_]

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z]')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

answered Aug 08 '19 at 19:34

Himanshu Aggarwal

163
1
1
5

This one creates one token for each letter. – Rishabh Gupta Mar 17 '20 at 07:03

score 1 · Answer 11 · answered Aug 09 '18 at 17:10

Remove punctuaion(It will remove . as well as part of punctuation handling using below code)

        tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
        text_string = text_string.translate(tbl) #text_string don't have punctuation
        w = word_tokenize(text_string)  #now tokenize the string

Sample Input/Output:

direct flat in oberoi esquire. 3 bhk 2195 saleable 1330 carpet. rate of 14500 final plus 1% floor rise. tax approx 9% only. flat cost with parking 3.89 cr plus taxes plus possession charger. middle floor. north door. arey and oberoi woods facing. 53% paymemt due. 1% transfer charge with buyer. total cost around 4.20 cr approx plus possession charges. rahul soni

['direct', 'flat', 'oberoi', 'esquire', '3', 'bhk', '2195', 'saleable', '1330', 'carpet', 'rate', '14500', 'final', 'plus', '1', 'floor', 'rise', 'tax', 'approx', '9', 'flat', 'cost', 'parking', '389', 'cr', 'plus', 'taxes', 'plus', 'possession', 'charger', 'middle', 'floor', 'north', 'door', 'arey', 'oberoi', 'woods', 'facing', '53', 'paymemt', 'due', '1', 'transfer', 'charge', 'buyer', 'total', 'cost', 'around', '420', 'cr', 'approx', 'plus', 'possession', 'charges', 'rahul', 'soni']

score 0 · Answer 12 · answered May 05 '23 at 21:40

Since from string import punctuation merely provides the string variable punctuation containing special characters ...

!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~

... it can be tailored like removing the single quote to leave apostrophe's in place such as in the word it's

Can assign your own. I'm changing punctuation to punctuations with an added 's' and it can be plugged into some of the other answers.

punctuations = '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~'  # \' removed
text = " It'll be ok-ish!?? " 
text = ' '.join(filter(None, (word.strip(punctuation) for word in text.split())))

... where text becomes:

"It'll be ok-ish"

How to get rid of punctuation using NLTK tokenizer?

12 Answers12

Linked

Related