NLTK Sentence Tokenzier: Punctuation Followed by Double Quotes Issue

Question

NLTK PunktSentenceTokenizer doesn't find the end of sentence properly.

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
punkt_param.abbrev_types.add(['dr', 'vs', 'mr', 'mrs', 'prof', 'inc', 'rev'])
sentence_splitter = PunktSentenceTokenizer(punkt_param)
sentence_splitter.tokenize(u'In that paper, "Has Financial Development Made the World Riskier?", Rajan "argued that disaster might loom." ')

Ouput:

[u'In that paper, "Has Financial Development Made the World Riskier?"',
 u', Rajan "argued that disaster might loom."']

another one:

sentence_splitter.tokenize(u'Don "Don C." Crowley')

Output:

[u'Don "Don C."', u'Crowley']

Both inputs should not be split into two sentences. Is there any way to handle this?

You may want to train `PunktTokenizer` to improve its accuracy, see https://stackoverflow.com/questions/21160310/training-data-format-for-nltk-punkt — DYZ, Aug 02 '17 at 22:29
Sentence boundary detection is hard to do without understanding what you read (so, for any computer program). If you train a new model, it may or may not perform as well as the one supplied with the nltk-- be prepared to evaluate both and compare. Unless you have text with strange punctuation/space conventions, the best approach is simply to live with imperfect results. How often does this happen, and how much do the mistakes _really_ impact your practical goals? — alexis, Aug 03 '17 at 08:32

NLTK Sentence Tokenzier: Punctuation Followed by Double Quotes Issue

0 Answers0