How do I delimit my input by this capture group?

Question

For this regular expression:

(?<!Mr|Dr|Ms|Jr|Sr)[.?!]+(\s)[A-Z0-9]

I want the input string to be split by the captured matching \s character - the green matches as seen over here.

However, when I run this:

import re

p = re.compile(ur'(?<!Mr|Dr|Ms|Jr|Sr)[.?!]+(\s)[A-Z0-9]')

test_str = u"Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.23 is the ish. My name is! Why wouldn't you... this is.\nAndrew"

re.split(p, test_str)

It seems to split the string at the regions given by [.?!]+ and [A-Z0-9] (thus incorrectly omitting them) and leaves \s in the results.

To clarify:

Input: he paid a lot for it. Did he mind

Received Output: ['he paid a lot for it','\s','id he mind']

Expected Output: ['he paid a lot for it.','Did he mind']

You need to remove capturing group. Use [`'(?<!Mr|Dr|Ms|Jr|Sr)[.?!]+\s(?=[A-Z0-9])'`](https://regex101.com/r/vA8iL6/2). — Wiktor Stribiżew, Nov 22 '15 at 18:29

Wiktor Stribiżew · Accepted Answer · 2015-11-22T19:12:42.280

You need to remove the capturing group from around (\s) and put the last character class into a look-ahead to exclude it from the match:

p = re.compile(ur'(?<!Mr|Dr|Ms|Jr|Sr)[.?!]+\s(?=[A-Z0-9])')
#                                          ^^^^^        ^
test_str = u"Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.23 is the ish. My name is! Why wouldn't you... this is.\nAndrew"
print(p.split(test_str))

See IDEONE demo and the regex demo.

Any capturing group in a regex pattern will create an additional element in the resulting array during re.split.

To force the punctuation to appear inside the "sentences", you can use this matching regex with re.findall:

import re
p = re.compile(r'\s*((?:(?:Mr|Dr|Ms|Jr|Sr)\.|\.(?!\s+[A-Z0-9])|[^.!?])*[.?!]|[^.!?]+)')
test_str = "Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.23 is the ish. My name is! Why wouldn't you... this is.\nAndrew"
print(p.findall(test_str))

See IDEONE demo

Results:

['Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it.', 'Did he mind?', "Adam Jones Jr. thinks he didn't.", "In any case, this isn't true...", "Well, with a probability of .9 it isn't.23 is the ish.", 'My name is!', "Why wouldn't you... this is.", 'Andrew']

The regex demo

The regex follows the rules in your original pattern:

\s* - matches 0 or more whitespace to omit from the result
(?:(?:Mr|Dr|Ms|Jr|Sr)\.|\.(?!\s+[A-Z0-9])|[^.!?])*[.?!]|[^.!?]+) - 2 aternatives that are captured and returned by re.findall:
- (?:(?:Mr|Dr|Ms|Jr|Sr)\.|\.(?!\s+[A-Z0-9])|[^.!?])* - 0 or more sequences of...
  - (?:Mr|Dr|Ms|Jr|Sr)\. - abbreviated titles
  - \.(?!\s+[A-Z0-9]) - matches a dot not followed by 1 or more whitespace and then uppercase letters or digits
  - [^.!?] - any character but a ., !, and ?
or...
- [^.!?]+ - any one or more characters but a ., !, and ?

How would you modify to not exclude the `[.?!]+` in the result like in the example I've shown? — Louis93, Nov 22 '15 at 18:34
This is possible to some extent, but I'd rather you use `re.findall` then. Let me prep a demo. — Wiktor Stribiżew, Nov 22 '15 at 18:36
Sounds good - I'm guessing the limitation comes from the fixed-width lookbehind args? — Louis93, Nov 22 '15 at 18:41

How do I delimit my input by this capture group?

1 Answers1

Linked