Splitting on regex without removing delimiters

Question

So, I would like to split this text into sentences.

s = "You! Are you Tom? I am Danny."

so I get:

["You!", "Are you Tom?", "I am Danny."]

That is I want to split the text by the regex '[.!\?]' without removing the delimiters. What is the most pythonic way to achieve this in python?

I am aware of these questions:

JS string.split() without removing the delimiters

Python split() without removing the delimiter

But my problem has various delimiters (.?!) which complicates the problem.

score 20 · Accepted Answer · answered May 29 '17 at 14:17

20

You can use re.findall with regex .*?[.!\?]; the lazy quantifier *? makes sure each pattern matches up to the specific delimiter you want to match on:

import re

s = """You! Are you Tom? I am Danny."""
re.findall('.*?[.!\?]', s)
# ['You!', ' Are you Tom?', ' I am Danny.']

answered May 29 '17 at 14:17

Psidom

209,562
33
339
356

A minor plus: if the substrings contain new line (`\n`) that we want to keep, we can pass `flags=re.DOTALL` into the `findall()` function, to make `.` also match `\n`. – Lucecpkn Mar 16 '22 at 16:42

score 8 · Answer 2 · answered May 29 '17 at 14:24

8

Strictly speaking, you don't want to split on '!?.', but rather on the whitespace that follows those characters. The following will work:

>>> import re
>>> re.split(r'(?<=[\.\!\?])\s*', s)
['You!', 'Are you Tom?', 'I am Danny.']

This splits on whitespace, but only if it is preceded by either a ., !, or ? character.

answered May 29 '17 at 14:24

Ruud de Jong

744
3
11

`re.split(r'(?<=[\.\!\?])\s+', s)` to avoid an empty string as the last match. – levsa Oct 31 '22 at 14:48

Dmitry Egorov · Answer 3 · 2017-05-29T15:06:58.007

7

If Python supported split by zero-length matches, you could achieve this by matching an empty string preceded by one of the delimiters:

(?<=[.!?])

Demo: https://regex101.com/r/ZLDXr1/1

Unfortunately, Python does not support split by zero-length matches. Yet the solution may still be useful in other languages that support lookbehinds.

However, based on you input/output data samples, you rather need to split by spaces preceded by one of the delimiters. So the regex would be:

(?<=[.!?])\s+

Demo: https://regex101.com/r/ZLDXr1/2

Python demo: https://ideone.com/z6nZi5

If the spaces are optional, the re.findall solution suggested by @Psidom is the best one, I believe.

edited May 29 '17 at 15:06

answered May 29 '17 at 14:18

Dmitry Egorov

9,542
3
22
40

I think I didn't state my problem clearly enough. What if there are now spaces after the `[.\?!]` ? – GA1 May 29 '17 at 14:38
2

Python has support for zero-length matches as of 3.7. – vahvero Nov 19 '19 at 11:22

Serge · Answer 4 · 2017-05-29T17:15:06.197

If you prefer use split method rather than match, one solution split with group

splitted = filter(None, re.split( r'(.*?[\.!\?])', s))

Filter removes empty strings if any.

This will work even if there is no spaces between sentences, or if you need catch trailing sentence that ends with a different punctuation sign, such as an unicode ellipses (or does have any at all)

It even possible to keep you re as is (with escaping correction and adding parenthesis).

splitted = filter(None, re.split( r'([\.!\?])', s))

Then merge even and uneven elements and remove extra spaces

Python split() without removing the delimiter

score 0 · Answer 5 · answered Dec 29 '17 at 17:47

0

Easiest way is to use nltk.

import nltk   
nltk.sent_tokenize(s)

It will return a list of all your sentences without loosing delimiters.

answered Dec 29 '17 at 17:47

Amir Imani

3,118
2
22
24

This works only if you are working with English text. In particular, the question is much more general about any delimiters. Further, as awesome as `nltk` is, blindly relying on it without understanding the inner workings of any library is dangerous. – Hrishikesh Feb 28 '22 at 06:33

Splitting on regex without removing delimiters

5 Answers5

Linked

Related