8

So, I would like to split this text into sentences.

s = "You! Are you Tom? I am Danny."

so I get:

["You!", "Are you Tom?", "I am Danny."]

That is I want to split the text by the regex '[.!\?]' without removing the delimiters. What is the most pythonic way to achieve this in python?

I am aware of these questions:

JS string.split() without removing the delimiters

Python split() without removing the delimiter

But my problem has various delimiters (.?!) which complicates the problem.

wjandrea
  • 28,235
  • 9
  • 60
  • 81
GA1
  • 1,568
  • 2
  • 19
  • 30

5 Answers5

20

You can use re.findall with regex .*?[.!\?]; the lazy quantifier *? makes sure each pattern matches up to the specific delimiter you want to match on:

import re

s = """You! Are you Tom? I am Danny."""
re.findall('.*?[.!\?]', s)
# ['You!', ' Are you Tom?', ' I am Danny.']
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • A minor plus: if the substrings contain new line (`\n`) that we want to keep, we can pass `flags=re.DOTALL` into the `findall()` function, to make `.` also match `\n`. – Lucecpkn Mar 16 '22 at 16:42
8

Strictly speaking, you don't want to split on '!?.', but rather on the whitespace that follows those characters. The following will work:

>>> import re
>>> re.split(r'(?<=[\.\!\?])\s*', s)
['You!', 'Are you Tom?', 'I am Danny.']

This splits on whitespace, but only if it is preceded by either a ., !, or ? character.

Ruud de Jong
  • 744
  • 3
  • 11
7

If Python supported split by zero-length matches, you could achieve this by matching an empty string preceded by one of the delimiters:

(?<=[.!?])

Demo: https://regex101.com/r/ZLDXr1/1

Unfortunately, Python does not support split by zero-length matches. Yet the solution may still be useful in other languages that support lookbehinds.

However, based on you input/output data samples, you rather need to split by spaces preceded by one of the delimiters. So the regex would be:

(?<=[.!?])\s+

Demo: https://regex101.com/r/ZLDXr1/2

Python demo: https://ideone.com/z6nZi5

If the spaces are optional, the re.findall solution suggested by @Psidom is the best one, I believe.

Dmitry Egorov
  • 9,542
  • 3
  • 22
  • 40
0

If you prefer use split method rather than match, one solution split with group

splitted = filter(None, re.split( r'(.*?[\.!\?])', s))

Filter removes empty strings if any.

This will work even if there is no spaces between sentences, or if you need catch trailing sentence that ends with a different punctuation sign, such as an unicode ellipses (or does have any at all)

It even possible to keep you re as is (with escaping correction and adding parenthesis).

splitted = filter(None, re.split( r'([\.!\?])', s))

Then merge even and uneven elements and remove extra spaces

Python split() without removing the delimiter

Serge
  • 3,387
  • 3
  • 16
  • 34
0

Easiest way is to use nltk.

import nltk   
nltk.sent_tokenize(s)

It will return a list of all your sentences without loosing delimiters.

Amir Imani
  • 3,118
  • 2
  • 22
  • 24
  • This works only if you are working with English text. In particular, the question is much more general about any delimiters. Further, as awesome as `nltk` is, blindly relying on it without understanding the inner workings of any library is dangerous. – Hrishikesh Feb 28 '22 at 06:33