21

I would like to run nltk Punkt to split sentences. There is no training model so I train model separately, but I am not sure if the training data format I am using is correct.

My training data is one sentence per line. I wasn't able to find any documentation about this, only this thread (https://groups.google.com/forum/#!topic/nltk-users/bxIEnmgeCSM) sheds some light about training data format.

What is the correct training data format for NLTK Punkt sentence tokenizer?

sophros
  • 14,672
  • 11
  • 46
  • 75
Asterisk
  • 3,534
  • 2
  • 34
  • 53
  • As I learned `PunktTrainer` can create a list of potential abbreviations without your supervision. Which helps sentence tokenization. But in my point of view it still does not work very well for abbreviations of English language. I'm not sure if it would do much in other languages. What I see in the source code is the use of language specific punctuations to tokenize the words, and newline and period to find sentence endings and abbreviations. I guess you need well formatted training sentences in each line. http://nltk.org/_modules/nltk/tokenize/punkt.html#PunktTrainer – Mehdi Jan 16 '14 at 12:33
  • @mehdi, you won't need the training sentence in each line. If you've done that you can simply extract features a train a supervised classifier. The magic of punkt is to do that in an unsupervised manner without initially specifying where's the sentence boundary. If you want to retrain punkt and manually specify abbreviations, see http://stackoverflow.com/questions/14095971/how-to-tweak-the-nltk-sentence-tokenizer – alvas Jan 17 '14 at 04:41

1 Answers1

27

Ah yes, Punkt tokenizer is the magical unsupervised sentence boundary detection. And the author's last name is pretty cool too, Kiss and Strunk (2006). The idea is to use NO annotation to train a sentence boundary detector, hence the input will be ANY sort of plaintext (as long as the encoding is consistent).

To train a new model, simply use:

import nltk.tokenize.punkt
import pickle
import codecs
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
text = codecs.open("someplain.txt","r","utf8").read()
tokenizer.train(text)
out = open("someplain.pk","wb")
pickle.dump(tokenizer, out)
out.close()

To achieve higher precision and allow you to stop training at any time and still save a proper pickle for your tokenizer, do look at this code snippet for training a German sentence tokenizer, https://github.com/alvations/DLTK/blob/master/dltk/tokenize/tokenizer.py :

def train_punktsent(trainfile, modelfile):
  """ Trains an unsupervised NLTK punkt sentence tokenizer. """
  punkt = PunktTrainer()
  try:
    with codecs.open(trainfile, 'r','utf8') as fin:
      punkt.train(fin.read(), finalize=False, verbose=False)
  except KeyboardInterrupt:
    print 'KeyboardInterrupt: Stopping the reading of the dump early!'
  ##HACK: Adds abbreviations from rb_tokenizer.
  abbrv_sent = " ".join([i.strip() for i in \
                         codecs.open('abbrev.lex','r','utf8').readlines()])
  abbrv_sent = "Start"+abbrv_sent+"End."
  punkt.train(abbrv_sent,finalize=False, verbose=False)
  # Finalize and outputs trained model.
  punkt.finalize_training(verbose=True)
  model = PunktSentenceTokenizer(punkt.get_params())
  with open(modelfile, mode='wb') as fout:
    pickle.dump(model, fout, protocol=pickle.HIGHEST_PROTOCOL)
  return model

However do note that the period detection is very sensitive to the latin fullstop, question mark and exclamation mark. If you're going to train a punkt tokenizer for other languages that doesn't use latin orthography, you'll need to somehow hack the code to use the appropriate sentence boundary punctuation. If you're using NLTK's implementation of punkt, edit the sent_end_chars variable.

There are pre-trained models available other than the 'default' English tokenizer using nltk.tokenize.sent_tokenize(). Here they are: https://github.com/evandrix/nltk_data/tree/master/tokenizers/punkt

Edited

Note the pre-trained models are currently not available because the nltk_data github repo listed above has been removed.

Derlin
  • 9,572
  • 2
  • 32
  • 53
alvas
  • 115,346
  • 109
  • 446
  • 738
  • 1
    The link to the paper is broken. Here's a new one http://www.mitpressjournals.org/doi/abs/10.1162/coli.2006.32.4.485 – BlackBear Mar 23 '16 at 08:40