trying to split a sentence with regex expression

Question

So far this regex expression found here works nice in almost every contest i'm working with.

(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=[.?])\s

infect it's able to split properly even sentences like this one:

Mr. Daniel, who love cakes, is taking a trip to Nevada. Not gonna lie, i would go as well.

into:

Mr. Daniel, who love cakes, is taking a trip to Nevada.
Not gonna lie, i would go as well.

Unfortunately it doenst cover a case. if i've, for example, a sentence like this:

C. Daniel, who love cakes, is taking a trip to Nevada. Not gonna lie, i would go as well.

this regex epression makes the split in three sub sentences:

C.
Daniel, who love cakes, is taking a trip to Nevada.
Not gonna lie, i would go as well.

Instead of:

C. Daniel, who love cakes, is taking a trip to Nevada.
Not gonna lie, i would go as well.

What we're missing is this specific case is when we find a match that has a single Uppercase Chart followed by a dot (.) we dont have to split.

I still dont know how to proper use regex so if you can tell me also why your answer would work will be much appreciate

score 0 · Answer 1 · answered Feb 16 '21 at 11:39

0

You could extend the pattern adding a negative lookbehind (?<!\b[A-Z]\.) to assert not an uppercase char followed by a . directly to the left.

I think you can also omit the dot after \w. as the dot matches any character except a newline.

(?<!\b[A-Z]\.)(?<!\w\.\w)(?<![A-Z][a-z]\.)(?<=[.?])\s

See a regex demo

answered Feb 16 '21 at 11:39

The fourth bird

154,723
16
55
70

Seems it's working! could you please just tell me why did u put a \b as well? – costabrava Feb 16 '21 at 11:45
@costabrava In that case, the string can split when it ends on 2 uppercase chars. https://regex101.com/r/hmu10Q/1 – The fourth bird Feb 16 '21 at 11:48

score 0 · Answer 2 · answered Feb 16 '21 at 11:42

If you want a non-regex based solution, you can use nltk here.

import nltk

txt_1 = "C. Daniel, who love cakes, is taking a trip to Nevada. Not gonna lie, i would go as well."

nltk.sent_tokenize(txt_1)

['C. Daniel, who love cakes, is taking a trip to Nevada.',
 'Not gonna lie, i would go as well.']

txt_2 = "Mr. Daniel, who love cakes, is taking a trip to Nevada. Not gonna lie, i would go as well."
nltk.sent_tokenize(txt_2)

['Mr. Daniel, who love cakes, is taking a trip to Nevada.',
 'Not gonna lie, i would go as well.']

trying to split a sentence with regex expression

2 Answers2