How to count sentences taking into account the occurrence of ellipses

Question

I've written the following script to count the number of sentences in a text file:

import re

filepath = 'sample_text_with_ellipsis.txt'

with open(filepath, 'r') as f:
    read_data = f.read()

sentences = re.split(r'[.{1}!?]+', read_data.replace('\n',''))
sentences = sentences[:-1]
sentence_count = len(sentences)

However, if I run it on a sample_text_with_ellipsis.txt with the following content:

Wait for it... awesome!

I get sentence_count = 2 instead of 1, because it does not ignore the ellipsis (i.e., the "...").

What I tried to do in the regex is to make it match only one occurrence of a period through .{1}, but this apparently doesn't work the way I intended it. How can I get the regex to ignore ellipses?

Is `Wait for it... Awesome!` to be considered one sentence or two? — Jongware, Jul 26 '16 at 11:47

score 4 · Accepted Answer · edited May 23 '17 at 11:58

4

Splitting sentences with a regex like this is not enough. See Python split text on sentences to see how NLTK can be leveraged for this.

Answering your question, you call 3 dot sequence an ellipsis. Thus, you need to use

[!?]+|(?<!\.)\.(?!\.)

See the regex demo. The . is moved from the character class since you can't use quantifiers inside them, and only that . is matched that is not enclosed with other dots.

[!?]+ - 1 or more ! or ?
| - or
(?<!\.)\.(?!\.) - a dot that is neither preceded ((?<!\.)), nor followed ((?!\.)) with a dot.

See Python demo:

import re
sentences = re.split(r'[!?]+|(?<!\.)\.(?!\.)', "Wait for it... awesome!".replace('\n',''))
sentences = sentences[:-1]
sentence_count = len(sentences)
print(sentence_count)  # => 1

edited May 23 '17 at 11:58

Community

1
1

answered Jul 26 '16 at 11:49

Wiktor Stribiżew

607,720
39
448
563

Hello, how do you do when you have a sentence that contains 'number.number' and you don't want to cut it ? – jos97 Aug 23 '21 at 09:21
1

@jos97 This is an old answer, now, I'd use spacy. To avoid matching a dot between two digits, you can use `\.(?!(?<=\d.)\d)`. So, the pattern above will turn into `r'[!?]+|(?<!\.)\.(?!(?<=\d.)\d)(?!\.)'` – Wiktor Stribiżew Aug 23 '21 at 09:26

score 0 · Answer 2 · answered Jul 26 '16 at 15:03

Following Wiktor's suggestion to use NLTK, I also came up with the following alternative solution:

import nltk
read_data="Wait for it... awesome!"
sentence_count = len(nltk.tokenize.sent_tokenize(read_data))

This yields a sentence count of 1 as expected.

How to count sentences taking into account the occurrence of ellipses

2 Answers2

Related