How do I tokenize a string sentence in NLTK?

Question

I am using nltk, so I want to create my own custom texts just like the default ones on nltk.books. However, I've just got up to the method like

my_text = ['This', 'is', 'my', 'text']

I'd like to discover any way to input my "text" as:

my_text = "This is my text, this is a nice way to input text."

Which method, python's or from nltk allows me to do this. And more important, how can I dismiss punctuation symbols?

Could you clarify, what do you mean by `underestimate punctation symbols` ? — quetzalcoatl, Feb 25 '13 at 14:01
Yeah, for example if I did: sentente = "This is my sentence, a sentence is a short expression" So, 'sentence,' and 'sentence' would be two different elements ... — diegoaguilar, Mar 02 '13 at 18:15

score 169 · Accepted Answer · answered Feb 24 '13 at 23:28

169

This is actually on the main page of nltk.org:

>>> import nltk
>>> sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']

answered Feb 24 '13 at 23:28

Pavel Anossov

60,842
14
151
124

3

the problem is that it doesn't split /. If you have "today and/or tomorrow are good days", it gives "and/or" as a single token by default. – thang Oct 21 '16 at 18:05
3

how do we convert "n't" to "not"? – Omayr Apr 12 '17 at 13:50
@Omayr, I would use regular expressions to convert "n't" to "not". I have attached some sample code below. { re.sub("'t", 'ot', "n't, doesn't, can't, don't")} **bold** – Samuel Nde Aug 08 '18 at 18:34
I was using word_tokenize in Python2, but in Python3 I would like to have a list of bytes, not strings. Is it possible? – Tedo Vrbanec Apr 02 '19 at 16:00

alvas · Answer 2 · 2017-07-27T07:13:58.600

-7

As @PavelAnossov answered, the canonical answer, use the word_tokenize function in nltk:

from nltk import word_tokenize
sent = "This is my text, this is a nice way to input text."
word_tokenize(sent)

If your sentence is truly simple enough:

Using the string.punctuation set, remove punctuation then split using the whitespace delimiter:

import string
x = "This is my text, this is a nice way to input text."
y = "".join([i for i in x if not in string.punctuation]).split(" ")
print y

edited Jul 27 '17 at 07:13

answered Mar 01 '13 at 07:48

alvas

115,346
109
446
738

3

@pavel's answer will resolve problems like `didn't` -> `did` + `n't` – alvas Jun 17 '13 at 07:03
What are the issues with `word_tokenize`? Seeing there are so many downvotes, I want to make sure I didn't miss something. – flow2k Jul 29 '19 at 04:47
1

I didn't downvote, but I'm guessing your answer is essentially a copy of Pavel's answer. Maybe a comment on his answer would have been more appropriate. – Anoyz Dec 11 '19 at 12:47

How do I tokenize a string sentence in NLTK?

2 Answers2

Linked

Related