Split text into sentences

Question

I wish to split text into sentences. Can anyone help me?

I also need to handle abbreviations. However my plan is to replace these at an earlier stage. Mr. -> Mister

import re  
import unittest    

class Sentences:

    def __init__(self,text):
        self.sentences = tuple(re.split("[.!?]\s", text))

class TestSentences(unittest.TestCase):

    def testFullStop(self):
        self.assertEquals(Sentences("X. X.").sentences, ("X.","X."))

    def testQuestion(self):
        self.assertEquals(Sentences("X? X?").sentences, ("X?","X?"))

    def testExclaimation(self):
        self.assertEquals(Sentences("X! X!").sentences, ("X!","X!"))

    def testMixed(self):
        self.assertEquals(Sentences("X! X? X! X.").sentences, ("X!", "X?", "X!", "X."))

Thanks, Barry

EDIT: To start with, I would be happy to satisfy the four tests I've included above. This would help me understand better how regexs work. For now I can define a sentence as X. etc as defined in my tests.

Have a look at [pyparsing](http://pyparsing.wikispaces.com/) — MattH, Aug 25 '11 at 10:29

Ido.Co · Accepted Answer · 2011-08-25T10:39:29.177

5

Sentence Segmentation can be a very difficult task, especially when the text contains dotted abbreviations. it may require a use of lists of known abbreviations, or training classifier to recognize them.

I suggest you to use NLTK - it a suite of open source Python modules, designed for natural language processing.

You can read about Sentence Segmentation using NLTK here, and decide for yourself if this tool fits you.

EDITED: or even simpler here and here is the source code. This is The Punkt sentence tokenizer, included in NLTK.

edited Aug 25 '11 at 10:39

answered Aug 25 '11 at 10:32

Ido.Co

5,317
6
39
64

Im using Python 3 and NLTK isnt built for this yet. I already have a large list of abbreviations and I believe I can handle this issue at an earlier stage. – Baz Aug 25 '11 at 10:35
1

Hmmm... maybe you can use the Punkt source code, and adjust it to Python 3? On second thought that would take a lot of work.. – Ido.Co Aug 25 '11 at 10:43
links in `EDITED` section are dead. – Justin D. Apr 29 '15 at 16:47

Split text into sentences

1 Answers1

Linked