66

I am using nltk, so I want to create my own custom texts just like the default ones on nltk.books. However, I've just got up to the method like

my_text = ['This', 'is', 'my', 'text']

I'd like to discover any way to input my "text" as:

my_text = "This is my text, this is a nice way to input text."

Which method, python's or from nltk allows me to do this. And more important, how can I dismiss punctuation symbols?

diegoaguilar
  • 8,179
  • 14
  • 80
  • 129

2 Answers2

169

This is actually on the main page of nltk.org:

>>> import nltk
>>> sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
Pavel Anossov
  • 60,842
  • 14
  • 151
  • 124
  • 3
    the problem is that it doesn't split /. If you have "today and/or tomorrow are good days", it gives "and/or" as a single token by default. – thang Oct 21 '16 at 18:05
  • 3
    how do we convert "n't" to "not"? – Omayr Apr 12 '17 at 13:50
  • @Omayr, I would use regular expressions to convert "n't" to "not". I have attached some sample code below. { re.sub("'t", 'ot', "n't, doesn't, can't, don't")} **bold** – Samuel Nde Aug 08 '18 at 18:34
  • I was using word_tokenize in Python2, but in Python3 I would like to have a list of bytes, not strings. Is it possible? – Tedo Vrbanec Apr 02 '19 at 16:00
-7

As @PavelAnossov answered, the canonical answer, use the word_tokenize function in nltk:

from nltk import word_tokenize
sent = "This is my text, this is a nice way to input text."
word_tokenize(sent)

If your sentence is truly simple enough:

Using the string.punctuation set, remove punctuation then split using the whitespace delimiter:

import string
x = "This is my text, this is a nice way to input text."
y = "".join([i for i in x if not in string.punctuation]).split(" ")
print y
alvas
  • 115,346
  • 109
  • 446
  • 738
  • 3
    @pavel's answer will resolve problems like `didn't` -> `did` + `n't` – alvas Jun 17 '13 at 07:03
  • What are the issues with `word_tokenize`? Seeing there are so many downvotes, I want to make sure I didn't miss something. – flow2k Jul 29 '19 at 04:47
  • 1
    I didn't downvote, but I'm guessing your answer is essentially a copy of Pavel's answer. Maybe a comment on his answer would have been more appropriate. – Anoyz Dec 11 '19 at 12:47