Firstly, the sent_tokenize
function uses the punkt tokenizer that was used to tokenize well-formed English sentence. So by including the correct capitalization would have resolve your problem:
>>> from nltk import sent_tokenize
>>> s = 'valid any day after january 1. not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.'
>>> sent_tokenize(s)
['valid any day after january 1. not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.']
>>>>
>>> s2 = 'Valid any day after january 1. Not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.'
>>> sent_tokenize(s2)
['Valid any day after january 1.', 'Not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.']
Now, let's dig deeper, The Punkt tokenizer is an algorithm by Kiss and Strunk (2005), see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py for the implementation.
This tokenizer divides a text into a list of sentences, by using an
unsupervised algorithm to build a model for abbreviation words,
collocations, and words that start sentences. It must be trained on
a large collection of plaintext in the target language before it can
be used.
So in the case of sent_tokenize
, I'm quite sure it's train on a well-formed English corpus hence the fact that capitalization after a fullstop is a strong indication of sentence boundary. And fullstop itself might not be since we have things like i.e. , e.g.
And in some cases the corpus might have things like 01. put pasta in pot \n02. fill the pot with water
. With such sentence/documents in the training data, it is very likely that the algorithm thinks that fullstop following a non-captalized word is not a sentence boundary.
So to resolve the problem, I suggest the following:
- Manually segment 10-20% of your sentences and the retrain a corpus specific tokenizer
- Convert your corpus into well-formed orthography before using
sent_tokenize
See also: training data format for nltk punkt