Tokenize a paragraph into sentence and then into words in NLTK

Question

I am trying to input an entire paragraph into my word processor to be split into sentences first and then into words.

I tried the following code but it does not work,

    #text is the paragraph input
    sent_text = sent_tokenize(text)
    tokenized_text = word_tokenize(sent_text.split)
    tagged = nltk.pos_tag(tokenized_text)
    print(tagged)

however this is not working and gives me errors. So how do I tokenize paragraphs into sentences and then words?

An example paragraph:

This thing seemed to overpower and astonish the little dark-brown dog, and wounded him to the heart. He sank down in despair at the child's feet. When the blow was repeated, together with an admonition in childish sentences, he turned over upon his back, and held his paws in a peculiar manner. At the same time with his ears and his eyes he offered a small prayer to the child.

**WARNING:**This is just a random text from the internet, I do not own the above content.

Show the input, because the code will be different depending on the encoding, shape, input differences. — alvas, Jun 03 '16 at 05:39
@alvas here is the input, so what kind of encoding, shape and input differences should be included? — Nikhil Raghavendra, Jun 04 '16 at 05:58
Show an actual sample input... If it's just plain english text (not social media, e.g. twitter), you can easily do `[pos_tag(word_tokenize(sent)) for sent in sent_tokenize(text)]` and using Python3 should resolve most issues with utf-8. But if your input is a different encoding/format, you'll find further problems later. — alvas, Jun 04 '16 at 08:01
@alvas I will just be copy and pasting paragraphs after paragraphs but there are still some encoding issues. If I copy paste the example text, it gives me encoding errors, why? — Nikhil Raghavendra, Jun 04 '16 at 12:58
Upload a copy/sample of your file on dropbox or something and share it. Perhaps we may/may not be able to help. — alvas, Jun 04 '16 at 13:08
Also, which OS are you using? If you're using linux, what is the output of `locale` on the command line? — alvas, Jun 04 '16 at 13:09
@alvas, I am using Windows 10 and I am using Anaconda. Spyder IDE to be precise. — Nikhil Raghavendra, Jun 04 '16 at 14:04

slider · Accepted Answer · 2016-06-04T04:12:04.030

64

You probably intended to loop over sent_text:

import nltk

sent_text = nltk.sent_tokenize(text) # this gives us a list of sentences
# now loop over each sentence and tokenize it separately
for sentence in sent_text:
    tokenized_text = nltk.word_tokenize(sentence)
    tagged = nltk.pos_tag(tokenized_text)
    print(tagged)

edited Jun 04 '16 at 04:12

answered Jun 03 '16 at 04:18

slider

12,810
1
26
42

4

`reload(sys); sys.setdefaultencoding('utf8')` is [toxic code](http://stackoverflow.com/questions/3828723/why-should-we-not-use-sys-setdefaultencodingutf-8-in-a-py-script). And if it's `python3`, it's rather redundant. The printing itself depends on the locale set on the user's machine. – alvas Jun 03 '16 at 05:40
@Nikhil, do not do the `setdefaultencoding` hack. Ask a new question about the step that's giving you encoding problems, and you'll learn how to specify the file encoding when processing unicode. – alexis Jun 03 '16 at 10:44
1

[This](http://stackoverflow.com/questions/28657010/dangers-of-sys-setdefaultencodingutf-8) explains why it's a very bad idea. – alexis Jun 03 '16 at 10:56
Thanks for the warning :-) – Nikhil Raghavendra Jun 04 '16 at 05:53
who knows how to save the positions of the tokens? – Vladimir Stazhilov Feb 09 '18 at 10:35

Brian Cugelman · Answer 2 · 2017-12-25T01:50:20.103

Here's a shorter version. This will give you a data structure with each individual sentence, and each token within the sentence. I prefer the TweetTokenizer for messy, real world language. The sentence tokenizer is considered decent, but be careful not to lower your word case till after this step, as it may impact the accuracy of detecting the boundaries of messy text.

from nltk.tokenize import TweetTokenizer, sent_tokenize

tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in 
nltk.sent_tokenize(input_text)]
print(tokens_sentences)

Here's what the output looks like, which I cleaned up so the structure stands out:

[
['This', 'thing', 'seemed', 'to', 'overpower', 'and', 'astonish', 'the', 'little', 'dark-brown', 'dog', ',', 'and', 'wounded', 'him', 'to', 'the', 'heart', '.'], 
['He', 'sank', 'down', 'in', 'despair', 'at', 'the', "child's", 'feet', '.'], 
['When', 'the', 'blow', 'was', 'repeated', ',', 'together', 'with', 'an', 'admonition', 'in', 'childish', 'sentences', ',', 'he', 'turned', 'over', 'upon', 'his', 'back', ',', 'and', 'held', 'his', 'paws', 'in', 'a', 'peculiar', 'manner', '.'], 
['At', 'the', 'same', 'time', 'with', 'his', 'ears', 'and', 'his', 'eyes', 'he', 'offered', 'a', 'small', 'prayer', 'to', 'the', 'child', '.']
]

Thanks for giving info about the TweetTokenizer! – information_interchange Aug 10 '18 at 18:47 — information_interchange, Aug 10 '18 at 18:47

score 5 · Answer 3 · edited Apr 18 '21 at 13:57

import nltk  

textsample ="This thing seemed to overpower and astonish the little dark-brown dog, and wounded him to the heart. He sank down in despair at the child's feet. When the blow was repeated, together with an admonition in childish sentences, he turned over upon his back, and held his paws in a peculiar manner. At the same time with his ears and his eyes he offered a small prayer to the child."  

sentences = nltk.sent_tokenize(textsample)  
words = nltk.word_tokenize(textsample)  
sentences 
[w for w in words if w.isalpha()]

The last line above will ensure only words are in the output and not special characters The sentence output is as below

['This thing seemed to overpower and astonish the little dark-brown dog, and wounded him to the heart.',
 "He sank down in despair at the child's feet.",
 'When the blow was repeated, together with an admonition in childish sentences, he turned over upon his back, and held his paws in a peculiar manner.',
 'At the same time with his ears and his eyes he offered a small prayer to the child.']

The words output is as below after removing special characters

['This',
 'thing',
 'seemed',
 'to',
 'overpower',
 'and',
 'astonish',
 'the',
 'little',
 'dog',
 'and',
 'wounded',
 'him',
 'to',
 'the',
 'heart',
 'He',
 'sank',
 'down',
 'in',
 'despair',
 'at',
 'the',
 'child',
 'feet',
 'When',
 'the',
 'blow',
 'was',
 'repeated',
 'together',
 'with',
 'an',
 'admonition',
 'in',
 'childish',
 'sentences',
 'he',
 'turned',
 'over',
 'upon',
 'his',
 'back',
 'and',
 'held',
 'his',
 'paws',
 'in',
 'a',
 'peculiar',
 'manner',
 'At',
 'the',
 'same',
 'time',
 'with',
 'his',
 'ears',
 'and',
 'his',
 'eyes',
 'he',
 'offered',
 'a',
 'small',
 'prayer',
 'to',
 'the',
 'child']

Tokenize a paragraph into sentence and then into words in NLTK

3 Answers3

Linked

Related