Splitting longer patterns using regex without losing characters Python 3+

Question

My program needs to split my natural language text into sentences. I made a mock sentence splitter using re.split in Python 3+. It looks like this:

re.split('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]', content)

I need to split the sentence at the whitespace when the pattern occurs. But the code, as it should, will split the text at the point the pattern occurs and not at the whitespace. It will not save the last character of the sentence including the sentence terminator.

"Is this the number 3? The text goes on..."

will look like

"Is this the number " and "he text goes on..."

Is there a way I can specify at which point the data should be split while keeping my patterns or do I have to look for alternatives?

Have you considered using lookarounds to find the space on which you actually want to split? — jonrsharpe, May 18 '15 at 13:47
@jonrsharpe: that only works if there is at least one character you capture. — Willem Van Onsem, May 18 '15 at 13:51
Shameless self promotion but: http://stackoverflow.com/questions/29988595/python-regex-splitting-on-pattern-match-that-is-an-empty-string Using the accepted solution there with lookarounds, you can probably succeed. — Shashank, May 18 '15 at 13:52
Oh and alternatively, if you don't want to use the accepted solution there, come up with an alternative regex with lookarounds that matches a full sentence, and use `re.findall` :) Then you will lose no characters in the "split". — Shashank, May 18 '15 at 13:57
@CommuSoft true, I assumed the OP would capture the whitespace they refer to. — jonrsharpe, May 18 '15 at 14:01
Thanks for replies. I like the @Shashank sentence recognition solution, be it with findall or whatever, as i can still keep my patterns. — Andris Leduskrasts, May 18 '15 at 14:43
Is there a reason you need to use that specific pattern, instead of for instance `(?<=[.?!])\s`? See [this sample code](https://glot.io/snippets/e3l5p8snjk) — ohaal, May 18 '15 at 14:56
@ohaal The program I'm working on has to do with language grammar, The language in question (latvian) has specific cases where general-used patterns such as yours doesn't work. (1st being written as 1., so "1. question" is a grammatically correct and some other specifics). The pattern I made is not ultimately perfect, but it works for a high number of cases and the ones they don't are usually stylistically questionable. — Andris Leduskrasts, May 18 '15 at 16:11

score 1 · Accepted Answer · edited May 23 '17 at 10:26

As @jonrsharpe says, one can use lookaround to reduce the number of characters splitted away, for instance to a single one. For instance if you don't mind losing space characters, you could use something like:

>>> re.split('\s(?=[A-Z])',content)
['Is this the number 3?', 'The text goes on...']

You can split using spaces with the next character an uppercase. But the T is not consumed, only the space.

Alternative approach: alternating split/capture item

You can however use another approach. In case you split, you eat content, but you can use the same regex to generate a list of matches. These matches is the data that was placed in between. By merging the matches in between the splitted items, you reconstruct the full list:

from itertools import chain, izip
import re

def nonconsumesplit(regex,content):
    outer = re.split(regex,content)
    inner = re.findall(regex,content)+['']
    return [val for pair in zip(outer,inner) for val in pair]

Which results in:

>>> nonconsumesplit('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]',content)
['Is this the number ', '3? ', 'The text goes on...', '']
>>> list(nonconsumesplit('\s',content))
['Is', ' ', 'this', ' ', 'the', ' ', 'number', ' ', '3?', ' ', 'The', ' ', 'text', ' ', 'goes', ' ', 'on...', '']

Or you can use a string concatenation:

def nonconsumesplitconcat(regex,content):
    outer = re.split(regex,content)
    inner = re.findall(regex,content)+['']
    return [pair[0]+pair[1] for pair in zip(outer,inner)]

Which results in:

>>> nonconsumesplitconcat('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]',content)
['Is this the number 3? ', 'The text goes on...']
>>> nonconsumesplitconcat('\s',content)
['Is ', 'this ', 'the ', 'number ', '3? ', 'The ', 'text ', 'goes ', 'on...']

Thanks! The uppercase solution is not really usable as there's a statistically high chance of capital words (names, towns, be it whatever), with low chance of it being combined with a number. Reconstructing the list is what I was looking for. — Andris Leduskrasts, May 18 '15 at 14:50
@andrisleduskrasts: well it was only an example of course. But as you have probably noted during the discussion, a lookaround regex with no character will not work. So to make the problem generic enough, one needs to reconstruct the consumed substrings oneself. — Willem Van Onsem, May 18 '15 at 14:54

Splitting longer patterns using regex without losing characters Python 3+

1 Answers1

Alternative approach: alternating split/capture item