1

I've written the following script to count the number of sentences in a text file:

import re

filepath = 'sample_text_with_ellipsis.txt'

with open(filepath, 'r') as f:
    read_data = f.read()

sentences = re.split(r'[.{1}!?]+', read_data.replace('\n',''))
sentences = sentences[:-1]
sentence_count = len(sentences)

However, if I run it on a sample_text_with_ellipsis.txt with the following content:

Wait for it... awesome!

I get sentence_count = 2 instead of 1, because it does not ignore the ellipsis (i.e., the "...").

What I tried to do in the regex is to make it match only one occurrence of a period through .{1}, but this apparently doesn't work the way I intended it. How can I get the regex to ignore ellipses?

Kurt Peek
  • 52,165
  • 91
  • 301
  • 526

2 Answers2

4

Splitting sentences with a regex like this is not enough. See Python split text on sentences to see how NLTK can be leveraged for this.

Answering your question, you call 3 dot sequence an ellipsis. Thus, you need to use

[!?]+|(?<!\.)\.(?!\.)

See the regex demo. The . is moved from the character class since you can't use quantifiers inside them, and only that . is matched that is not enclosed with other dots.

  • [!?]+ - 1 or more ! or ?
  • | - or
  • (?<!\.)\.(?!\.) - a dot that is neither preceded ((?<!\.)), nor followed ((?!\.)) with a dot.

See Python demo:

import re
sentences = re.split(r'[!?]+|(?<!\.)\.(?!\.)', "Wait for it... awesome!".replace('\n',''))
sentences = sentences[:-1]
sentence_count = len(sentences)
print(sentence_count)  # => 1
Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Hello, how do you do when you have a sentence that contains 'number.number' and you don't want to cut it ? – jos97 Aug 23 '21 at 09:21
  • 1
    @jos97 This is an old answer, now, I'd use spacy. To avoid matching a dot between two digits, you can use `\.(?!(?<=\d.)\d)`. So, the pattern above will turn into `r'[!?]+|(?<!\.)\.(?!(?<=\d.)\d)(?!\.)'` – Wiktor Stribiżew Aug 23 '21 at 09:26
0

Following Wiktor's suggestion to use NLTK, I also came up with the following alternative solution:

import nltk
read_data="Wait for it... awesome!"
sentence_count = len(nltk.tokenize.sent_tokenize(read_data))

This yields a sentence count of 1 as expected.

Kurt Peek
  • 52,165
  • 91
  • 301
  • 526