Identify Sentences in Text

Question

I am having a bit of a trouble correctly identifying sentences in a text for specific corner cases:

If a dot, dot, dot is involved, this will not be kept.
If " are involved.
If a sentence accidentally start with a lower case.

This is how I identify sentences in text so far (source: Subtitles Reformat to end with complete sentence):

re.findall part basically looks for a chunk of str that starts with a capital letter, [A-Z], then anything except the punctuation, then ends with the punctuation, [\.?!].

import re
text = "We were able to respond to the first research question. Next, we also determined the size of the population."
    for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
        print(sentence + "\n")

We were able to respond to the first research question.

Next, we also determined the size of the population.

Corner Case 1: Dot, Dot, Dot

The dot,dot,dot, is not kept, since there are no instructions given for what to do if three dots appear in a row. How could this be changed ?

text = "We were able to respond to the first research question... Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")

We were able to respond to the first research question.

Next, we also determined the size of the population.

Corner Case 2: "

The "symbol is successfully kept within a sentence, but like the dot's following the punctuation, it will be deleted at the end.

text = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")

We were able to respond to the first "research" question: "What is this?

Next, we also determined the size of the population.

Corner Case 3: lower case start of a sentence

If a sentence accidentally starts with a lower case, the sentence will be ignored. The aim would be to identify that a previous sentence ended (or the text just started) and hence a new sentence has to start.

text = "We were able to respond to the first research question. next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")

We were able to respond to the first research question.

Edit

I tested it:

import spacy
from spacy.lang.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

...but I get:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-157-4fd093d3402b> in <module>()
      6 nlp = English()
      7 doc = nlp(raw_text)
----> 8 sentences = [sent.string.strip() for sent in doc.sents]

<ipython-input-157-4fd093d3402b> in <listcomp>(.0)
      6 nlp = English()
      7 doc = nlp(raw_text)
----> 8 sentences = [sent.string.strip() for sent in doc.sents]

doc.pyx in sents()

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with:
nlp.add_pipe(nlp.create_pipe('sentencizer')) Alternatively, add the
dependency parser, or set sentence boundaries by setting
doc[i].is_sent_start.

Good idea. Might do that, if not other options are available. — henry, May 17 '19 at 07:59
Not about your corner cases, but a general thought: Maybe you can just split your text into sentences using the indicator `. ` , a single dot followed by space, not preceded by other dots? If at least this would be a common factor, all other thinks like quotation marks etc. could be ignored, but i'm just guessing around. To create a regex matching a dot not preceded by other specified characters, see: https://www.regular-expressions.info/lookaround.html — xph, May 17 '19 at 08:00

BlueSheepToken · Accepted Answer · 2019-05-17T09:08:28.827

You could modify your regex to match your corner cases.

First of all, you do not need to escape . inside []

For the first corner case, you can greedily match the ending-sentance-token with [.!?]*

For the second, you can potentially match " after [.!?]

For the last one, you can start your sentance with either upper or lower :

import re

regex = r'([A-z][^.!?]*[.!?]*"?)'

text = "We were able to respond to the first research question... Next, we also determined the size of the population."
for sentence in re.findall(regex, text):
    print(sentence)
print()

text = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
for sentence in re.findall(regex, text):
    print(sentence)
print()

text = "We were able to respond to the first research question. next, we also determined the size of the population."
for sentence in re.findall(regex, text):
    print(sentence)

Explanation

[A-z], every match should start with a letter, either upper or lower.
[^.?!]*, it matches greedily any character which is not ., ? or ! (an ending sentance character)
[.?!]*, it matches greedily the ending characters, so ...??!!??? will be match as part of the sentance
"?, it eventually matches a quote at the end of the sentance

Corner case 1:

We were able to respond to the first research question... Next, we also determined the size of the population.

Corner case 2:

We were able to respond to the first "research" question: "What is this?" Next, we also determined the size of the population.

Corner case 3:

We were able to respond to the first research question. next, we also determined the size of the population.

Great answer !! This is what I was looking for. Just a quick question: What do you mean by " it matches greedily" ? — henry, May 17 '19 at 08:14
It means that `...` will be matched, a non greedy matching with `*?` will not match `...` — BlueSheepToken, May 17 '19 at 08:45

Novak · Answer 2 · 2019-05-17T08:33:25.843

1

You can use some of the industrial packages for that. For example, spacy has a very good sentence tokenizer.

from __future__ import unicode_literals, print_function
from spacy.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

Your scenarios:

case result -> ['We were able to respond to the first research question...', 'Next, we also determined the size of the population.']
case result -> ['We were able to respond to the first "research" question: "What is this?"', 'Next, we also determined the size of the population.']
case result -> ['We were able to respond to the first research question.', 'next, we also determined the size of the population.']

edited May 17 '19 at 08:33

answered May 17 '19 at 08:01

Novak

2,143
1
12
22

Thanks for your answer. Is this free ? – henry May 17 '19 at 08:12
Yes, it is. No problem :) – Novak May 17 '19 at 08:13
Would you mind trying the three examples of my question and posting the results in your answer ? – henry May 17 '19 at 08:15
Would you mind having a look at my updated question ? I tested your method, but I keep getting an error. – henry May 17 '19 at 09:25
I haven't seen that error. Try going to the Spacy page and try to download everything that you need (like neg dictionary etc.). That should solve your problem. – Novak May 20 '19 at 09:05

score 0 · Answer 3 · answered May 17 '19 at 08:04

0

Try this regex: ([A-Z][^.!?]*[.!?]+["]?)

'+' means one or more

'?' means zero or more

This should pass all 3 corner cases you mentioned above

answered May 17 '19 at 08:04

boedrs

53
5

Identify Sentences in Text

Edit

3 Answers3

Explanation