There is an example in nltk.org book (chapter 6) where they use a NaiveBayesian algorithm to classify a punctuation symbol as finishing a sentence or not finishing one...
This is what they do: First they take a corpus and use the .sent method to get the sentences and build an index from them of where the punctuation symbols that separate them (the boundaries
) are.
Then they "tokenize" the text (convert it to list of words and punctuation symbols) and apply the following algorithm/function to each token so that they get a list of features which are returned in a dictionary:
def punct_features(tokens, i):
return {'nextWordCapitalized': tokens[i+1][0].isupper(),
'prevWord': tokens[i-1].lower(),
'punct': tokens[i],
'prevWordis1Char': len(tokens[i-1]) == 1}
These features will be used by the ML algorithm to classify the punctuation symbol as finishing a sentence or not (i.e as a boundary token).
With this fn and the 'boundaries' index, they select all the punctuation tokens, each with its features, and tag them as True
boundary, or False
one, thus creating a list of labeled feature-sets:
featuresets1 = [(punct_features(tokens, i), (i in boundaries)) for i in range(1, len(tokens)-1)
if tokens[i] in '.?!;']
print(featuresets1[:4])
This is an example of the outpout we could have when printing the first four sets:
[({'nextWordCapitalized': False, 'prevWord': 'nov', 'punct': '.', 'prevWordis1Char': False}, False),
({'nextWordCapitalized': True, 'prevWord': '29', 'punct': '.', 'prevWordis1Char': False}, True),
({'nextWordCapitalized': True, 'prevWord': 'mr', 'punct': '.', 'prevWordis1Char': False}, False),
({'nextWordCapitalized': True, 'prevWord': 'n', 'punct': '.', 'prevWordis1Char': True}, False)]
With this, they train and evaluate the punctuation classifier:
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)
Now, (1) how and what would such a ML algorithm improve? I can't grasp how could it better the first simple algorithm that just checks if next token from the punctuation symbol is Uppercase and previous is lowercase. Indeed that algorithm is taken to validate that a symbol is a boundary...! And if it doesn't improve it, what could possibly be useful for?
And related with this: (2) is any of these two algorithms how nlpk really separates sentences? I mean, specially if the best is the first simple one, does nltk understand that sentences is just a text between two punctuation symbols that are followed by a word with first chart in uppercase and previous word in lowercase? Is this what .sent method does? Notice that this is far from how Linguistics or better said, the Oxford dictionary, defines a sentence:
"A set of words that is complete in itself, typically containing a subject and predicate, conveying a statement, question, exclamation, or command, and consisting of a main clause and sometimes one or more subordinate clauses."
Or (3) are the raw corpora texts like treebank
or brown
already divided by sentences manually? - in this case, what is the criterion to select them?