5

My program takes a text file and splits each sentence into a list using split('.') meaning that it will split when it registers a full stop however it can be inaccurate.

For Example

str='i love carpets. In fact i own 2.4 km of the stuff.'

Output

listOfSentences = ['i love carpets', 'in fact i own 2', '4 km of the stuff']

Desired Output

 listOfSentences = ['i love carpets', 'in fact i own 2.4 km of the stuff']

My question is: How do I split the end of sentences and not at every full stop.

Marko
  • 142
  • 2
  • 12
  • listOfSentences = file.split(".") – Marko Sep 23 '15 at 20:30
  • 6
    Splitting into sentences is a non-trivial task. Maybe you can try Natural Language Toolkit. [Link](http://stackoverflow.com/questions/4576077/python-split-text-on-sentences) to the similar question. – Estiny Sep 23 '15 at 20:31
  • Indeed, also consider abbreviations, e.g. like this one. Tokenisation and sentence splitting is quite an interesting, albeit under-appreciated, task in. NLTK surely has tokenisation and sentence splitting functions. For a specialized solution you can also consider using _ucto_ with python-ucto (https://github.com/proycon/ucto , https://github.com/proycon/python-ucto), which can tokenize and sentence split various languages. [_disclaimer_: I am the author of ucto] – proycon Sep 23 '15 at 21:22

5 Answers5

3

Any regex based approach cannot handle cases like "I saw Mr. Smith.", and adding hacks for those cases is not scalable. As user est has commented, any serious implementation uses data.

If you need to handle English only then spaCy is better than NLTK:

from spacy.en import English
en = English()
doc = en(u'i love carpets. In fact i own 2.4 km of the stuff.')
for s in list(doc.sents):
    print s.string

Update: spaCy now supports many languages.

Adam Bittlingmayer
  • 1,169
  • 9
  • 22
  • AFAIK there are no up-to-date quantitative evaluations about sentence splitting. Regarding your statement that spacy is better than nltk for English: I just experience the opposite, i.e., while I switched to spacy almost completely, I figured that its sentence splitting performance is not as good as the performance of nltk punkt for English news articles. – pedjjj Dec 12 '19 at 17:25
  • You may be right, I can't remember why exactly I asserted that, even though I would still bet it's true. – Adam Bittlingmayer Dec 12 '19 at 18:45
0

I found https://github.com/fnl/syntok/ to be quite good, actually the best from all the popular ones. Specifically, I tested nltk (punkt), spacy, and syntok on English news articles.

import syntok.segmenter as segmenter

document = "some text. some more text"

for paragraph in segmenter.analyze(document):
    for sentence in paragraph:
        for token in sentence:
            # exactly reproduce the input
            # and do not remove "imperfections"
            print(token.spacing, token.value, sep='', end='')
    print("\n")  # reinsert paragraph separators
pedjjj
  • 958
  • 3
  • 18
  • 40
-1

Not splitting at numbers can be done using the split function of the re module:

>>> import re
>>> s = 'i love carpets. In fact i own 2.4 km of the stuff.'
>>> re.split(r'\.[^0-9]', s)
['i love carpets', 'In fact i own 2.4 km of the stuff.']
halfr
  • 36
  • 7
-2

The simplest way is to split on a dot followed by a space as:

>>> s = 'i love carpets. In fact i own 2.4 km of the stuff.'
>>> s.split('. ')
['i love carpets', 'In fact i own 2.4 km of the stuff.']
Irshad Bhat
  • 8,479
  • 1
  • 26
  • 36
  • 1
    okay but what if it there was a case where there wasn't a space after for example: `population of 142,100,.[2] falling to 142,065 at the 2011 Census` the [2] stops this from working – Marko Sep 23 '15 at 20:34
  • 2
    Also what about abbreviations followed by dot? And for example question marks, exclamation marks, etc.? – Estiny Sep 23 '15 at 20:36
  • You are looking for http://stackoverflow.com/questions/4576077/python-split-text-on-sentences – Irshad Bhat Sep 23 '15 at 21:10
-2

If you have sentences both ending with "." and ". ", you can try regex:

import re

text = "your text here. i.e. something."
sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)

source: Python - RegEx for splitting text into sentences (sentence-tokenizing)

Community
  • 1
  • 1
isamert
  • 482
  • 4
  • 12