Python Sentence Split using NLTK

Question

I am working on splitting paragraph into sentences.

I googled and found that nltk mostly works well with splitting sentences, but I found one problem.

import nltk

sent_detector = nltk.data.load('tokenizers/punkt/english.pickle') 
summary = 'George Stanley McGovern (July 19, 1922 – October 21, 2012) was an American historian, author, U.S. Representative, U.S. Senator, and the Democratic Party presidential nominee in the 1972 presidential election.'
summary = (sent_detector.tokenize(summary))

The result should be just one sentence. However, it returns two sentences.

['George Stanley McGovern (July 19, 1922 \x96 October 21, 2012) was an American historian, author, U.S. Representative, U.S.', 'Senator, and the Democratic Party presidential nominee in the 1972 presidential election.']

Code from D Greenberg seems better than NLTK so far.http://stackoverflow.com/questions/4576077/python-split-text-on-sentences — Ayden Kim, Aug 02 '16 at 18:38
As you said yourself, it mostly works well. You found an example that managed to prove its weakness. I stumble upon examples like this all the time. You can try splitta: https://github.com/lukeorland/splitta — bogs, Aug 03 '16 at 11:03
Sentence boundary detection is _hard_. This is an example of that. I suspect it's the trailing period, which I'm not sure is correct. Is it the U.S.A.? — Athena, Aug 03 '16 at 19:47
Yes, it is the U.S.A. So far the code that I mentioned before works better than NLTK. http://stackoverflow.com/questions/4576077/python-split-text-on-sentences — Ayden Kim, Aug 04 '16 at 20:23
http://morphadorner.northwestern.edu/sentencesplitter/example/ This package could be good too, but it is Java. — Ayden Kim, Aug 04 '16 at 20:25

Python Sentence Split using NLTK

0 Answers0