3

I've noticed that the NLTK sent_tokenizer makes mistakes with some dates. Is there any way to adjust it so that it can correctly tokenize the following:

valid any day after january 1. not valid on federal holidays, including february 14,
or with other in-house events, specials, or happy hour.

Currently running sent_tokenize results in:

['valid any day after january 1. not valid on federal holidays, including february 14, 
 or with other in-house events, specials, or happy hour.']

But it should result in:

['valid any day after january 1.', 'not valid on federal holidays, including february 14, 
  or with other in-house events, specials, or happy hour.']

as the period after 'january 1' is a legitimate sentence termination character.

user2694306
  • 3,832
  • 10
  • 47
  • 95

1 Answers1

5

Firstly, the sent_tokenize function uses the punkt tokenizer that was used to tokenize well-formed English sentence. So by including the correct capitalization would have resolve your problem:

>>> from nltk import sent_tokenize
>>> s = 'valid any day after january 1. not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.'
>>> sent_tokenize(s)
['valid any day after january 1. not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.']
>>>>
>>> s2 = 'Valid any day after january 1. Not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.'
>>> sent_tokenize(s2)
['Valid any day after january 1.', 'Not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.']

Now, let's dig deeper, The Punkt tokenizer is an algorithm by Kiss and Strunk (2005), see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py for the implementation.

This tokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

So in the case of sent_tokenize, I'm quite sure it's train on a well-formed English corpus hence the fact that capitalization after a fullstop is a strong indication of sentence boundary. And fullstop itself might not be since we have things like i.e. , e.g.

And in some cases the corpus might have things like 01. put pasta in pot \n02. fill the pot with water. With such sentence/documents in the training data, it is very likely that the algorithm thinks that fullstop following a non-captalized word is not a sentence boundary.

So to resolve the problem, I suggest the following:

  1. Manually segment 10-20% of your sentences and the retrain a corpus specific tokenizer
  2. Convert your corpus into well-formed orthography before using sent_tokenize

See also: training data format for nltk punkt

Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738
  • Great answer. I'll try to move the tokenizer to before the text is converted to lowercase. – user2694306 Dec 02 '14 at 09:16
  • You should not lower case your corpus unnecessarily. It might decrease sparsity in model training and improve retrieval rate in IR but it results in noisy models and almost always result in bad preprocessing because the tokenizing, tagging, parsing models were built on well-formed data. – alvas Dec 02 '14 at 09:24
  • 1
    E.g. `The Sleeping Dog is a good game`, in this case `Sleeping Dog` is a named entity and if you lowercased it before POS tagging, i'm pretty sure the POS tagger will say it's a "sleeping (adjective/adjectival verb) dog (noun)" instead of `sleep (NNP) dog (NNP)` – alvas Dec 02 '14 at 09:26
  • 1
    Have fun with preprocessing, it often leads to amazingly different results in NLP depending on the order and models or parameters you used when preprocessing... – alvas Dec 02 '14 at 09:27