43

I am learning Natural Language Processing using NLTK. I came across the code using PunktSentenceTokenizer whose actual use I cannot understand in the given code. The code is given :

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text) #A

tokenized = custom_sent_tokenizer.tokenize(sample_text)   #B

def process_content():
try:
    for i in tokenized[:5]:
        words = nltk.word_tokenize(i)
        tagged = nltk.pos_tag(words)
        print(tagged)

except Exception as e:
    print(str(e))


process_content()

So, why do we use PunktSentenceTokenizer. And what is going on in the line marked A and B. I mean there is a training text and the other a sample text, but what is the need for two data sets to get the Part of Speech tagging.

Line marked as A and B is which I am not able to understand.

PS : I did try to look in the NLTK book but could not understand what is the real use of PunktSentenceTokenizer

arqam
  • 3,582
  • 5
  • 34
  • 69

4 Answers4

39

PunktSentenceTokenizer is the abstract class for the default sentence tokenizer, i.e. sent_tokenize(), provided in NLTK. It is an implmentation of Unsupervised Multilingual Sentence Boundary Detection (Kiss and Strunk (2005). See https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L79

Given a paragraph with multiple sentence, e.g:

>>> from nltk.corpus import state_union
>>> train_text = state_union.raw("2005-GWBush.txt").split('\n')
>>> train_text[11]
u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all. This evening I will set forth policies to advance that ideal at home and around the world. '

You can use the sent_tokenize():

>>> sent_tokenize(train_text[11])
[u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.', u'This evening I will set forth policies to advance that ideal at home and around the world. ']
>>> for sent in sent_tokenize(train_text[11]):
...     print sent
...     print '--------'
... 
Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.
--------
This evening I will set forth policies to advance that ideal at home and around the world. 
--------

The sent_tokenize() uses a pre-trained model from nltk_data/tokenizers/punkt/english.pickle. You can also specify other languages, the list of available languages with pre-trained models in NLTK are:

alvas@ubi:~/nltk_data/tokenizers/punkt$ ls
czech.pickle     finnish.pickle  norwegian.pickle   slovene.pickle
danish.pickle    french.pickle   polish.pickle      spanish.pickle
dutch.pickle     german.pickle   portuguese.pickle  swedish.pickle
english.pickle   greek.pickle    PY3                turkish.pickle
estonian.pickle  italian.pickle  README

Given a text in another language, do this:

>>> german_text = u"Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, Göttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter. Über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südniedersächsischen Orgellandschaft vollständig oder in Teilen erhalten. "

>>> for sent in sent_tokenize(german_text, language='german'):
...     print sent
...     print '---------'
... 
Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, Göttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter.
---------
Über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südniedersächsischen Orgellandschaft vollständig oder in Teilen erhalten. 
---------

To train your own punkt model, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py and training data format for nltk punkt

Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738
  • Thanks for the answer. I don't know why anyone downvoted this. Anyways, these are pre-trained model. But can you please tell me how the working of `PunktSentenceTokenizer` will change if I use my own training set. I mean what actually happens in `Training`. – arqam Feb 09 '16 at 04:57
  • 1
    @Arquam In training, the model parameters are set according to what is observed in the training data. – Joachim Wagner Sep 17 '18 at 06:39
25

PunktSentenceTokenizer is an sentence boundary detection algorithm that must be trained to be used [1]. NLTK already includes a pre-trained version of the PunktSentenceTokenizer.

So if you use initialize the tokenizer without any arguments, it will default to the pre-trained version:

In [1]: import nltk
In [2]: tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
In [3]: txt = """ This is one sentence. This is another sentence."""
In [4]: tokenizer.tokenize(txt)
Out[4]: [' This is one sentence.', 'This is another sentence.']

You can also provide your own training data to train the tokenizer before using it. Punkt tokenizer uses an unsupervised algorithm, meaning you just train it with regular text.

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

For most of the cases, it is totally fine to use the pre-trained version. So you can simply initialize the tokenizer without providing any arguments.

So "what all this has to do with POS tagging"? The NLTK POS tagger works with tokenized sentences, so you need to break your text into sentences and word tokens before you can POS tag.

NLTK's documentation.

[1] Kiss and Strunk, " Unsupervised Multilingual Sentence Boundary Detection"

CentAu
  • 10,660
  • 15
  • 59
  • 85
  • Note that `PunktSentenceTokenizer` don't train the tokenizer but loads the pre-train models. To train a new tokenizer, use `PunktTrainer` https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py#L607 – alvas Feb 08 '16 at 21:43
  • @alvas. It does train the tokenizer. If the train_text argument is not None, it calls the train module of the tokenizer `if train_text: self.train(train_text, verbose, finalize=True)` – CentAu Feb 08 '16 at 21:45
  • yep, i was going to type, too slow to type. Or use `PunktSentenceTokenizer.train()` – alvas Feb 08 '16 at 21:46
  • 1
    BTW, it's not me who downvoted your answer. It might be some troll. – alvas Feb 08 '16 at 21:47
  • Same here! Not clear why two correct answers are down voted. – CentAu Feb 08 '16 at 21:50
  • 1
    Yup, this is relevant and to the point. Two things should be added: (a) the Punkt tokenizer uses an *unsupervised* algorithm, meaning you just train it with regular text. (b) What all this has to do with POS tagging, as the OP asks: The NLTK POS tagger works with tokenized sentences, so you need to break your text into sentences and word tokens before you can POS tag. – alexis Feb 08 '16 at 22:07
  • @alexis Thanks for the note. Added them to the answer. – CentAu Feb 08 '16 at 22:17
  • @CentAu . Lol looks like someone was trolling to give downvotes. Thanks for the answer. So can you please tell me why should I use `PunktSentenceTokenizer` and not directly tokenize by say, `word_tokenize` to get `POS` tagging? – arqam Feb 09 '16 at 04:59
  • @Arqam The NLTK POS tagger works more accurately with tokenized sentences, so you need to break your text into sentences before you can POS tag. If your text is only a single sentence, you can directly use the `word_tokenize` function. Otherwise, first split the sentences (using either the `PunktSentenceTokenizer` or simply `sent_tokenize` function of the nltk) and then apply `word_tokenize` on each of the sentences. – CentAu Feb 09 '16 at 18:30
1

You can refer below link to get more insight on usage of PunktSentenceTokenizer. It vividly explains why PunktSentenceTokenizer is used instead of sent-tokenize() with regard to your case.

http://nlpforhackers.io/splitting-text-into-sentences/

0
def process_content(corpus):

    tokenized = PunktSentenceTokenizer().tokenize(corpus)

    try:
        for sent in tokenized:
            words = nltk.word_tokenize(sent)
            tagged = nltk.pos_tag(words)
            print(tagged)
    except Exception as e:
        print(str(e))

process_content(train_text)

Without even training it on other text data it works the same as it is pre-trained.

ashirwad
  • 11
  • 3