nltk tokenization and contractions

Question

I'm tokenizing text with nltk, just sentences fed to wordpunct_tokenizer. This splits contractions (e.g. 'don't' to 'don' +" ' "+'t') but I want to keep them as one word. I'm refining my methods for a more measured and precise tokenization of text, so I need to delve deeper into the nltk tokenization module beyond simple tokenization.

I'm guessing this is common and I'd like feedback from others who've maybe had to deal with the particular issue before.

edit:

Yeah this a general, splattershot question I know

Also, as a novice to nlp, do I need to worry about contractions at all?

EDIT:

The SExprTokenizer or TreeBankWordTokenizer seems to do what I'm looking for for now.

score 13 · Accepted Answer · answered Jul 06 '12 at 01:39

13

Which tokenizer you use really depends on what you want to do next. As inspectorG4dget said, some part-of-speech taggers handle split contractions, and in that case the splitting is a good thing. But maybe that's not what you want. To decide which tokenizer is best, consider what you need for the next step, and then submit your text to http://text-processing.com/demo/tokenize/ to see how each NLTK tokenizer behaves.

answered Jul 06 '12 at 01:39

Jacob

4,204
1
25
25

thanks. I used `nltk.WhitespaceTokenizer().tokenize("why don't you?")` following your suggestion on that webpage.. and got `['why', "don't", 'you?']`. I'll update my posted answer below if I find a way to tokenize the punctuation too with this tokenizer. – alchemy Oct 26 '22 at 19:58

score 2 · Answer 2 · answered Jul 05 '12 at 19:54

I've worked with NLTK before on this project. When I did, I found that contractions were useful to consider.

However, I did not write custom tokenizer, I simply handled it after POS tagging.

I suspect this is not the answer that you are looking for, but I hope it helps somewhat

score 2 · Answer 3 · answered Jul 06 '12 at 02:44

2

Because the number of contractions are very minimal, one way to do it is to search and replace all contractions to it full equivalent (Eg: "don't" to "do not") and then feed the updated sentences into the wordpunct_tokenizer.

answered Jul 06 '12 at 02:44

Neodawn

1,086
1
6
9

score 1 · Answer 4 · answered Oct 26 '22 at 19:56

1

Use

nltk.WhitespaceTokenizer().tokenize("why don't you?")
>['why', "don't", 'you?']

answered Oct 26 '22 at 19:56

alchemy

954
10
17

nltk tokenization and contractions

4 Answers4

Linked