1

I am splitting a string without removing delimiters, by putting the entire capture regex in parentheses. The intent is to match sentences ending in one or more '[!?]' characters.

All is great except I now get unwanted empty capture groups - how to suppress those, in the least hackish and most regexish way?

>>> re.compile(r'([^!?]*[!?]+)').split('Great customer service!  Very happy! Will go again')
['', 'Great customer service!', '', '  Very happy!', ' Will go again']

>>> re.compile(r'([^!?]{2,}[!?]+)').split('Great customer service!  Very happy! Will go again')
['', 'Great customer service!', '', '  Very happy!', ' Will go again']

(This is all deeply nested inside more complex regexes and subfunctions, so really don't want hacks. I want the solution to be regexish so I can fold it into a more complex regex)

Community
  • 1
  • 1
smci
  • 32,567
  • 20
  • 113
  • 146
  • 1
    In cases like these it's usually better to use match instead of split (or whatever the similar functions are called in python). – Qtax Jun 19 '13 at 00:15
  • @Qtax: no, I need to keep non-matching expressions as well. I genuinely want to do a split, not just a match. – smci Jun 19 '13 at 00:18
  • 3
    Those empty strings aren't from capture groups, they're the results of the split. By discarding the natural results of the split while keeping the contents of the capture groups, you are getting almost exactly the same result you would have gotten from `findall()`. Do you *really* have to use `split()`? – Alan Moore Jun 19 '13 at 01:20

1 Answers1

2

This regex seems to work:

r'(?<=[!?])\s+(?=\S)'

What I'm trying to do is match the whitespace between sentences, but only if the preceding sentence ends with ? or !. This is slightly less hackish than your approach, but it's probably the best you're going to do. Manipulating natural language with regexes is hackish by definition. :D

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • Neat, but that matches the sentence **after** a sentence ending in one or more '[!?]' characters... I'll try to post better example text later. – smci Jun 19 '13 at 01:15
  • Any sentence that doesn't end with `?` or `!` will be treated as part of the next sentence that does. Is that what you're talking about? Your regex does the same thing. I'm just trying to correct your regex; your solution has other problems, too. – Alan Moore Jun 19 '13 at 01:25