Python - RegEx for splitting text into sentences (sentence-tokenizing)

Question

I want to make a list of sentences from a string and then print them out. I don't want to use NLTK to do this. So it needs to split on a period at the end of the sentence and not at decimals or abbreviations or title of a name or if the sentence has a .com This is attempt at regex that doesn't work.

import re

text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

for stuff in sentences:
        print(stuff)

Example output of what it should look like

Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. 
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.

Barrel full of monkeys: even if not using "NLTK", do use something a bit more appropriate than a single regular expression. — user2864740, Sep 09 '14 at 02:06
Any particular reason why you don't want to use NLTK? This is [exactly what it does](http://stackoverflow.com/questions/9474395/how-to-break-up-a-paragraph-by-sentences-in-python), among other things. There is also [this](https://github.com/fnl/sentence_splitter) that you can take a look at, it's a small library again doing this (and in fact, doing it with regexps). — Amadan, Sep 09 '14 at 02:10
Parsing natural human language and human-composed text is very, very hard for computers and there are many subtleties. Why don't you want to use NLTK which is designed exactly for this kind of problem? — Dan Lenski, Sep 09 '14 at 02:20
The basic NLTK `tokenize.sent_tokenize()` is pretty brutal. See my answer for a truckload of things it gets wrong. Don't disrespect the OP or the question, this is actually seriously non-trivial and interesting, and a topic of active research. — smci, Sep 09 '14 at 02:24
@smci: All you say is correct; NLTK will get stuff wrong. It is still better than a raw regexp. It is worse than specialised stuff like GeniaSS (for which you'd have to go out of Python). Which is yet again light years away of actual human. I'm not disrespecting the OP, but if he were aware of issues you speak of, he would not have demanded a regexp. — Amadan, Sep 09 '14 at 02:33
I can't use NLTK because I don't have administrative access to install NLTK! If anyone has another solution, I'm open to it? — user3590149, Sep 09 '14 at 02:51
@user3590149 try virtualenv; this lets you create a sandboxed Python environment in which can install whatever packages you like — ben author, Sep 09 '14 at 04:11
@Amadan: yes, but to clarify my point: **Distrust all out-of-the-box solutions, they all suck. Generally you have to brew your own application-specific, language-specific sentence-tokenizer**. And even then you only get as much asymptotic accuracy as you're prepared to invest time and money into. The OP has asked a simple-seeming question with a complex answer. — smci, Sep 09 '14 at 04:19
seconding @benauthor, this is an 'XY problem': OP you're trying to reinvent the wheel, because you refuse to install virtualenv, which hugely simplifies package and tool admin. You're solving the wrong problem with a non-scalable hack, and creating technical debt, ultimately this causes projects to crash and burn. Delegate NLTK stuff to NLTK itself, unless you find a showstopper with no workaround. (Also, if you ever do find an error case where NLTK gets it wrong, please submit a bug on the NLTK issue tracker, so someone in the community can fix it.) — smci, Aug 22 '17 at 10:36

vks · Accepted Answer · 2014-09-10T04:40:30.670

50

(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s

Try this. split your string this.You can also check demo.

http://regex101.com/r/nG1gU7/27

edited Sep 10 '14 at 04:40

answered Sep 09 '14 at 03:01

vks

67,027
10
91
124

Problem with this is that it doesn't deal with three periods at the end of sentence. It takes the last two out of the three and puts it on the beginning of the next sentence. – user3590149 Sep 09 '14 at 03:07
I added sample output above. – user3590149 Sep 09 '14 at 03:11
@user3590149 when you split on it one `.` will be lost.you can appned it later to each sentence and you are done . – vks Sep 09 '14 at 03:15
if you split according to the dot, the dot will disappear in the final output. – Avinash Raj Sep 09 '14 at 03:54
one more question, why it fails to spit this line `Did he mind? Adam Jones Jr. thinks he didn't`? – Avinash Raj Sep 09 '14 at 04:13
i'm not taking about the Jr. , why `Did he mind?` isn't shown separately. – Avinash Raj Sep 09 '14 at 04:16
@user3590149: it works, in very limited cases, until it doesn't, then it breaks. Did you even read my answer? I wonder why I bother. – smci Sep 09 '14 at 04:28
The only problem with your code is that it takes away the punctuation? – user3590149 Sep 10 '14 at 00:58
What if something like "C. elegans" is in a sentence. This breaks in that case. Can you mandate that the first character of the next sentence is capitalized? – Alex Lenail Feb 05 '16 at 22:48
@AlexLenail see https://regex101.com/r/nG1gU7/128 – vks Feb 05 '16 at 23:48
How can i convert this regex expression into Google spreadsheet format? :) Thank you! – viv227295 May 07 '19 at 12:59
1

(?<!\w\.\w.)(?<![A-Z]\.)(?<![A-Z][a-z]\.)(?<=\.|\?) worked somewhat better. – amann Jun 24 '19 at 16:22
1

I expand it to `(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!\:)\s+|\p{Cc}+|\p{Cf}+` to also considers invisible chars. – Borhan Kazimipour Oct 21 '19 at 03:38
For newcomers, beware this regexp does not work for every case. Just try to copy some text from any book and investigate it. – funnydman Jan 20 '20 at 19:32
Doesn't work correctly for `Springer v.', 'United States, 102 U.S. 586 (1881), was a case.` or `The 100 mm vz. 53 was a dual-purpose field gun.` Any ideas on how I can fix this? Also how do I take out the `...` and ignore anything in `"", '', (), [], {}`? Thanks otherwise, it's awesome. – Superdooperhero Feb 22 '20 at 12:41
Thank you guys for your effort. These don't match newline(s) tho. Would someone update it to match newline(s)? – mending3 Jan 31 '21 at 04:46
@vks tested it and didn't work. \n|\r|\n\r may occur right after a dot, coma, etc and when it really occurs. the expression won't match. Especially when parsing a txt file – mending3 Jan 31 '21 at 05:01
@mending3 can u post the regex101.com link – vks Jan 31 '21 at 05:03
https://regex101.com/r/A1NaPj/1 – mending3 Jan 31 '21 at 05:05
Is there a way to make this regex compatible with safari since it does not support negative look behinds and positive look behinds? – Aurel Drejta May 11 '21 at 11:53
Something like this? `(?:|\w\.\w.)(?:|[A-Z][a-z]\.)(\.|\?)\s` – Aurel Drejta May 11 '21 at 11:58
@AurelDrejta you dont need ?:| ... Try capturing it and later replacing it – vks May 11 '21 at 12:00
Can you give an exmaple of how? Sorry I don't fully understand. – Aurel Drejta May 11 '21 at 12:34
1

This does function almost satisfactorily!.. – Stanislav Koncebovski Jan 11 '23 at 18:11

smci · Answer 2 · 2021-03-21T21:37:32.263

Ok so sentence-tokenizers are something I looked at in a little detail, using regexes, nltk, CoreNLP, spaCy. You end up writing your own and it depends on the application. This stuff is tricky and valuable and people don't just give their tokenizer code away. (Ultimately, tokenization is not a deterministic procedure, it's probabilistic, and also depends very heavily on your corpus or domain, e.g. legal/financial documents vs social-media posts vs Yelp reviews vs biomedical papers...)

In general you can't rely on one single Great White infallible regex, you have to write a function which uses several regexes (both positive and negative); also a dictionary of abbreviations, and some basic language parsing which knows that e.g. 'I', 'USA', 'FCC', 'TARP' are capitalized in English.

To illustrate how easily this can get seriously complicated, let's try to write you that functional spec for a deterministic tokenizer just to decide whether single or multiple period ('.'/'...') indicates end-of-sentence, or something else:

function isEndOfSentence(leftContext, rightContext)

Return False for decimals inside numbers or currency e.g. 1.23 , $1.23, "That's just my $.02" Consider also section references like 1.2.A.3.a, European date formats like 09.07.2014, IP addresses like 192.168.1.1, MAC addresses...
Return False (and don't tokenize into individual letters) for known abbreviations e.g. "U.S. stocks are falling" ; this requires a dictionary of known abbreviations. Anything outside that dictionary you will get wrong, unless you add code to detect unknown abbreviations like A.B.C. and add them to a list.
Ellipses '...' at ends of sentences are terminal, but in the middle of sentences are not. This is not as easy as you might think: you need to look at the left context and the right context, specifically is the RHS capitalized and again consider capitalized words like 'I' and abbreviations. Here's an example proving ambiguity which : She asked me to stay... I left an hour later. (Was that one sentence or two? Impossible to determine)
You may also want to write a few patterns to detect and reject miscellaneous non-sentence-ending uses of punctuation: emoticons :-), ASCII art, spaced ellipses . . . and other stuff esp. Twitter. (Making that adaptive is even harder). How do we tell if @midnight is a Twitter user, the show on Comedy Central, text shorthand, or simply unwanted/junk/typo punctuation? Seriously non-trivial.
After you handle all those negative cases, you could arbitrarily say that any isolated period followed by whitespace is likely to be an end of sentence. (Ultimately, if you really want to buy extra accuracy, you end up writing your own probabilistic sentence-tokenizer which uses weights, and training it on a specific corpus(e.g. legal texts, broadcast media, StackOverflow, Twitter, forums comments etc.)) Then you have to manually review exemplars and training errors. See Manning and Jurafsky book or Coursera course [a]. Ultimately you get as much correctness as you are prepared to pay for.
All of the above is clearly specific to the English-language/ abbreviations, US number/time/date formats. If you want to make it country- and language-independent, that's a bigger proposition, you'll need corpora, native-speaking people to label and QA it all, etc.
All of the above is still only ASCII, which is practically speaking only 96 characters. Allow the input to be Unicode, and things get harder still (and the training-set necessarily must be either much bigger or much sparser)

In the simple (deterministic) case, function isEndOfSentence(leftContext, rightContext) would return boolean, but in the more general sense, it's probabilistic: it returns a float 0.0-1.0 (confidence level that that particular '.' is a sentence end).

References: [a] Coursera video: "Basic Text Processing 2-5 - Sentence Segmentation - Stanford NLP - Professor Dan Jurafsky & Chris Manning" [UPDATE: an unofficial version used to be on YouTube, was taken down]

@bootsmaat: the Jurafsky video is good but only covers deterministic, not probabilistic, tokenizing. A real-world approach should be probabilistic. — smci, Dec 18 '17 at 23:05
@Sundeep: sad to see that superb video was taking down due to copyright. I can't find Jurafsky lecture 2-5 video online. If you see any unofficial videos on Youtube, please let us know. — smci, Dec 06 '18 at 18:57
Just wanted to say thanks for writing out a relatively thorough list of things to look out for! I need to implement this in a different language and your list is the most comprehensive one I've seen! — rococo, Feb 07 '20 at 22:59
@rococo: Sure thing. In any case in recent decades tokenizing in NLP has moved heavily away from crisp rules-based and towards a probabilistic, context-specific, ephemeral thing which we learn using ML. Esp. when you need to handle incomplete, ungrammatical and/or wrongly-punctuated, multilanguage, slang, acronyms, emoticons, emoji, Unicode... the target keeps evolving. — smci, Feb 07 '20 at 23:09

score 6 · Answer 3 · answered Sep 09 '14 at 04:04

6

Try to split the input according to the spaces rather than a dot or ?, if you do like this then the dot or ? won't be printed in the final result.

>>> import re
>>> s = """Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."""
>>> m = re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', s)
>>> for i in m:
...     print i
... 
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.

answered Sep 09 '14 at 04:04

Avinash Raj

172,303
28
230
274

Works really good on well formatted text (i.e. all sentences must begin with a capital letter) – Iulius Curt Mar 15 '16 at 13:36
I like this solution the most. It will actually format it correctly in all the cases I've tried. I just added an exclamation mark to it. `(?<=[^A-Z].[.?!]) +(?=[A-Z])` – Ste Nov 22 '19 at 22:54
Can somebody explain the above regex – scv Jan 23 '20 at 22:00

score 2 · Answer 4 · answered Feb 27 '16 at 23:56

sent = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)',text)
for s in sent:
    print s

Here the regex used is : (?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)

First block: (?<!\w\.\w.) : this pattern searches in a negative feedback loop (?<!) for all words (\w) followed by fullstop (\.) , followed by other words (\.)

Second block: (?<![A-Z][a-z]\.): this pattern searches in a negative feedback loop for anything starting with uppercase alphabets ([A-Z]), followed by lower case alphabets ([a-z]) till a dot (\.) is found.

Third block: (?<=\.|\?): this pattern searches in a feedback loop of dot (\.) OR question mark (\?)

Fourth block: (\s|[A-Z].*): this pattern searches after the dot OR question mark from the third block. It searches for blank space (\s) OR any sequence of characters starting with a upper case alphabet ([A-Z].*). This block is important to split if the input is as

Hello world.Hi I am here today.

i.e. if there is space or no space after the dot.

score 1 · Answer 5 · answered Sep 09 '14 at 03:38

Naive approach for proper english sentences not starting with non-alphas and not containing quoted parts of speech:

import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
EndPunctuation = re.compile(r'([\.\?\!]\s+)')
NonEndings = re.compile(r'(?:Mrs?|Jr|i\.e)\.\s*$')
parts = EndPunctuation.split(text)
sentence = []
for part in parts:
  if len(part) and len(sentence) and EndPunctuation.match(sentence[-1]) and not NonEndings.search(''.join(sentence)):
    print(''.join(sentence))
    sentence = []
  if len(part):
    sentence.append(part)
if len(sentence):
  print(''.join(sentence))

False positive splitting may be reduced by extending NonEndings a bit. Other cases will require additional code. Handling typos in a sensible way will prove difficult with this approach.

You will never reach perfection with this approach. But depending on the task it might just work "enough"...

score 1 · Answer 6 · answered Jan 27 '18 at 21:53

I'm not great at regular expressions, but a simpler version, "brute force" actually, of above is

sentence = re.compile("([\'\"][A-Z]|([A-Z][a-z]*\. )|[A-Z])(([a-z]*\.[a-z]*\.)|([A-Za-z0-9]*\.[A-Za-z0-9])|([A-Z][a-z]*\. [A-Za-z]*)|[^\.?]|[A-Za-z])*[\.?]")

which means start acceptable units are '[A-Z] or "[A-Z]
please note, most regular expressions are greedy so the order is very important when we do |(or). That's, why I have written i.e. regular expression first, then is come forms like Inc.

score 0 · Answer 7 · answered Sep 09 '14 at 02:39

0

Try this:

(?<!\b(?:[A-Z][a-z]|\d|[i.e]))\.(?!\b(?:com|\d+)\b)

answered Sep 09 '14 at 02:39

walid toumi

2,172
1
13
10

does this work with split command. I got a invalid syntax at the first ? – user3590149 Sep 09 '14 at 02:48
Use regex module instead of re module. – walid toumi Sep 09 '14 at 02:55
Problem with this is that it doesn't deal with three periods at the end of sentence. It takes the last two out of the three and puts it on the beginning of the next sentence. – user3590149 Sep 09 '14 at 03:07

score 0 · Answer 8 · edited May 23 '17 at 12:09

I wrote this taking into consideration smci's comments above. It is a middle-of-the-road approach that doesn't require external libraries and doesn't use regex. It allows you to provide a list of abbreviations and accounts for sentences ended by terminators in wrappers, such as a period and quote: [.", ?', .)].

abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior', 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']


def find_sentences(paragraph):
   end = True
   sentences = []
   while end > -1:
       end = find_sentence_end(paragraph)
       if end > -1:
           sentences.append(paragraph[end:].strip())
           paragraph = paragraph[:end]
   sentences.append(paragraph)
   sentences.reverse()
   return sentences


def find_sentence_end(paragraph):
    [possible_endings, contraction_locations] = [[], []]
    contractions = abbreviations.keys()
    sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
    for sentence_terminator in sentence_terminators:
        t_indices = list(find_all(paragraph, sentence_terminator))
        possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
    for contraction in contractions:
        c_indices = list(find_all(paragraph, contraction))
        contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
    possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
    if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
        max_end_start = max([pe[0] for pe in possible_endings])
        possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
    possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
    end = (-1 if not len(possible_endings) else max(possible_endings))
    return end


def find_all(a_str, sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1:
            return
        yield start
        start += len(sub)

I used Karl's find_all function from this entry: Find all occurrences of a substring in Python

score 0 · Answer 9 · answered Sep 05 '18 at 22:58

My example is based on the example of Ali, adapted to Brazilian Portuguese. Thanks Ali.

ABREVIACOES = ['sra?s?', 'exm[ao]s?', 'ns?', 'nos?', 'doc', 'ac', 'publ', 'ex', 'lv', 'vlr?', 'vls?',
               'exmo(a)', 'ilmo(a)', 'av', 'of', 'min', 'livr?', 'co?ls?', 'univ', 'resp', 'cli', 'lb',
               'dra?s?', '[a-z]+r\(as?\)', 'ed', 'pa?g', 'cod', 'prof', 'op', 'plan', 'edf?', 'func', 'ch',
               'arts?', 'artigs?', 'artg', 'pars?', 'rel', 'tel', 'res', '[a-z]', 'vls?', 'gab', 'bel',
               'ilm[oa]', 'parc', 'proc', 'adv', 'vols?', 'cels?', 'pp', 'ex[ao]', 'eg', 'pl', 'ref',
               '[0-9]+', 'reg', 'f[ilí]s?', 'inc', 'par', 'alin', 'fts', 'publ?', 'ex', 'v. em', 'v.rev']

ABREVIACOES_RGX = re.compile(r'(?:{})\.\s*$'.format('|\s'.join(ABREVIACOES)), re.IGNORECASE)

        def sentencas(texto, min_len=5):
            # baseado em https://stackoverflow.com/questions/25735644/python-regex-for-splitting-text-into-sentences-sentence-tokenizing
            texto = re.sub(r'\s\s+', ' ', texto)
            EndPunctuation = re.compile(r'([\.\?\!]\s+)')
            # print(NonEndings)
            parts = EndPunctuation.split(texto)
            sentencas = []
            sentence = []
            for part in parts:
                txt_sent = ''.join(sentence)
                q_len = len(txt_sent)
                if len(part) and len(sentence) and q_len >= min_len and \
                        EndPunctuation.match(sentence[-1]) and \
                        not ABREVIACOES_RGX.search(txt_sent):
                    sentencas.append(txt_sent)
                    sentence = []

                if len(part):
                    sentence.append(part)
            if sentence:
                sentencas.append(''.join(sentence))
            return sentencas

Full code in: https://github.com/luizanisio/comparador_elastic

score -2 · Answer 10 · edited Jun 20 '20 at 09:12

-2

If you want to break up sentences at 3 periods (not sure if this is what you want) you can use this regular expresion:

import re

text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r'\.{3}', text)

for stuff in sentences:
     print(stuff)

edited Jun 20 '20 at 09:12

Community

1
1

answered Sep 09 '14 at 02:07

Jose Varez

2,019
1
12
9

1

That is not what he wants. – Amadan Sep 09 '14 at 02:07

Python - RegEx for splitting text into sentences (sentence-tokenizing)

10 Answers10

Linked