nltk.org example of Sentence segmentation with Naive Bayes Classifier: how does .sent separate sentences and how does the ML algorithm improve it?

Question

There is an example in nltk.org book (chapter 6) where they use a NaiveBayesian algorithm to classify a punctuation symbol as finishing a sentence or not finishing one...

This is what they do: First they take a corpus and use the .sent method to get the sentences and build an index from them of where the punctuation symbols that separate them (the boundaries) are.

Then they "tokenize" the text (convert it to list of words and punctuation symbols) and apply the following algorithm/function to each token so that they get a list of features which are returned in a dictionary:

def punct_features(tokens, i):
    return {'nextWordCapitalized': tokens[i+1][0].isupper(),
        'prevWord': tokens[i-1].lower(),
        'punct': tokens[i],
        'prevWordis1Char': len(tokens[i-1]) == 1}

These features will be used by the ML algorithm to classify the punctuation symbol as finishing a sentence or not (i.e as a boundary token).

With this fn and the 'boundaries' index, they select all the punctuation tokens, each with its features, and tag them as True boundary, or False one, thus creating a list of labeled feature-sets:

featuresets1 = [(punct_features(tokens, i), (i in boundaries)) for i in range(1, len(tokens)-1)
               if tokens[i] in '.?!;']
print(featuresets1[:4])

This is an example of the outpout we could have when printing the first four sets:

[({'nextWordCapitalized': False, 'prevWord': 'nov', 'punct': '.', 'prevWordis1Char': False}, False), 
({'nextWordCapitalized': True, 'prevWord': '29', 'punct': '.', 'prevWordis1Char': False}, True), 
({'nextWordCapitalized': True, 'prevWord': 'mr', 'punct': '.', 'prevWordis1Char': False}, False), 
({'nextWordCapitalized': True, 'prevWord': 'n', 'punct': '.', 'prevWordis1Char': True}, False)]

With this, they train and evaluate the punctuation classifier:

size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)

nltk.classify.accuracy(classifier, test_set)

Now, (1) how and what would such a ML algorithm improve? I can't grasp how could it better the first simple algorithm that just checks if next token from the punctuation symbol is Uppercase and previous is lowercase. Indeed that algorithm is taken to validate that a symbol is a boundary...! And if it doesn't improve it, what could possibly be useful for?

And related with this: (2) is any of these two algorithms how nlpk really separates sentences? I mean, specially if the best is the first simple one, does nltk understand that sentences is just a text between two punctuation symbols that are followed by a word with first chart in uppercase and previous word in lowercase? Is this what .sent method does? Notice that this is far from how Linguistics or better said, the Oxford dictionary, defines a sentence:

"A set of words that is complete in itself, typically containing a subject and predicate, conveying a statement, question, exclamation, or command, and consisting of a main clause and sometimes one or more subordinate clauses."

Or (3) are the raw corpora texts like treebank or brown already divided by sentences manually? - in this case, what is the criterion to select them?

score 3 · Accepted Answer · answered Dec 16 '18 at 13:19

Question (1): NLTK perhaps did not make it clear, but sentence segmentation is a difficult problem. Like you said, we can start with the assumption that a punctuation marker ends the sentence i.e., previous character is lower case, current character is punctuation, next char is uppercase (btw, there are spaces in between! don't forget!). However, consider this sentence:

"Mr. Peter works at a company called A.B.C. Inc. in Toronto. His net salary per month is $2344.21. 22 years ago, he came to Toronto as an immigrant." - Now, going by our rule above, how will this be split?

The Wikipedia page on sentence boundary disambiguation illustrates a few more of these issues. In the NLP textbook "Speech and Language Processing" by Jurafsky and Martin, they also have a chapter on Text normaliazation, with a few more examples of why word/sentence segmentation can be challenging - it could be useful for you to get an idea of this. I am assuming we are discussing about English segmentation, but clearly there are other issues with other languages (e.g., no capitalization in some languages).

Q 2: is any of these two algorithms how nlpk really separates sentences? NLTK uses a unsupervised sentence segmentation method called PunktSentenceTokenizer

Q3: are the raw corpora texts like treebank or brown already divided by sentences manually? - Yes, these were manually divided into sentences. These are some common corpora used in NLP for developing lingustic tools such as POS taggers, parsers etc. One reason for choosing these could be that they are already available within NLTK, and we don't have to look for another human annotated corpus to do supervised learning of sentence boundary detection.

Awesome answer! So, the supervised NaiveBayes won t ever learn to do it better than the previous simple (non ML) algorithm/function right? What is the purpose of using Naive Bayes here then? — Martin, Dec 16 '18 at 13:34
Supervised may or may not learn. The only way we can know is to try out all three approaches: a) just using the rule you mentioned (lowercase before, upper case after); b) using supervised approach; c) unsupervised one that NLTK currently uses. The purpose of naive bayes usage in the textbooks is pedagogical. Don't forget NLTK's primary purpose is teaching NLP :) — Sowmya, Dec 16 '18 at 13:38

score 2 · Answer 2 · answered Dec 16 '18 at 19:51

One thing that the (otherwise great) accepted answer left out is an explanation of what the NaiveBayes algorithm does beyond applying the rule you described.

Any classifier of this type is given several features, and it must decide how important they are as clues for classification. Next letter capital = very important, previous letter lowercase = important but less so, space after this punctuation = important, etc. Machine learning algorithms use some approach to assign weights (= importance) to each feature, so that the results are as good as the algorithm can provide. Some can do it in one step, others do it in baby steps where each iteration improves a little on the previous result ("hill climbing").

Details are different for each machine learning algorithm, and the specifics of Naive Bayes are not important here. But for completeness: it is a statistical calculation based on assuming (usually contrary to reality, but it is convenient) that each feature is statistically independent of all the others.

nltk.org example of Sentence segmentation with Naive Bayes Classifier: how does .sent separate sentences and how does the ML algorithm improve it?

2 Answers2