1

I am working on splitting paragraph into sentences.

I googled and found that nltk mostly works well with splitting sentences, but I found one problem.

import nltk

sent_detector = nltk.data.load('tokenizers/punkt/english.pickle') 
summary = 'George Stanley McGovern (July 19, 1922 – October 21, 2012) was an American historian, author, U.S. Representative, U.S. Senator, and the Democratic Party presidential nominee in the 1972 presidential election.'
summary = (sent_detector.tokenize(summary))

The result should be just one sentence. However, it returns two sentences.

['George Stanley McGovern (July 19, 1922 \x96 October 21, 2012) was an American historian, author, U.S. Representative, U.S.', 'Senator, and the Democratic Party presidential nominee in the 1972 presidential election.']

Evan Porter
  • 2,987
  • 3
  • 32
  • 44
Ayden Kim
  • 21
  • 7
  • Code from D Greenberg seems better than NLTK so far.http://stackoverflow.com/questions/4576077/python-split-text-on-sentences – Ayden Kim Aug 02 '16 at 18:38
  • As you said yourself, it mostly works well. You found an example that managed to prove its weakness. I stumble upon examples like this all the time. You can try splitta: https://github.com/lukeorland/splitta – bogs Aug 03 '16 at 11:03
  • Sentence boundary detection is _hard_. This is an example of that. I suspect it's the trailing period, which I'm not sure is correct. Is it the U.S.A.? – Athena Aug 03 '16 at 19:47
  • Yes, it is the U.S.A. So far the code that I mentioned before works better than NLTK. http://stackoverflow.com/questions/4576077/python-split-text-on-sentences – Ayden Kim Aug 04 '16 at 20:23
  • http://morphadorner.northwestern.edu/sentencesplitter/example/ This package could be good too, but it is Java. – Ayden Kim Aug 04 '16 at 20:25

0 Answers0