-2

So far this regex expression found here works nice in almost every contest i'm working with.

(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=[.?])\s

infect it's able to split properly even sentences like this one:

Mr. Daniel, who love cakes, is taking a trip to Nevada. Not gonna lie, i would go as well.

into:

Mr. Daniel, who love cakes, is taking a trip to Nevada.
Not gonna lie, i would go as well.

Unfortunately it doenst cover a case. if i've, for example, a sentence like this:

C. Daniel, who love cakes, is taking a trip to Nevada. Not gonna lie, i would go as well.

this regex epression makes the split in three sub sentences:

C.
Daniel, who love cakes, is taking a trip to Nevada.
Not gonna lie, i would go as well.

Instead of:

C. Daniel, who love cakes, is taking a trip to Nevada.
Not gonna lie, i would go as well.

What we're missing is this specific case is when we find a match that has a single Uppercase Chart followed by a dot (.) we dont have to split.

I still dont know how to proper use regex so if you can tell me also why your answer would work will be much appreciate

costabrava
  • 15
  • 7

2 Answers2

0

You could extend the pattern adding a negative lookbehind (?<!\b[A-Z]\.) to assert not an uppercase char followed by a . directly to the left.

I think you can also omit the dot after \w. as the dot matches any character except a newline.

(?<!\b[A-Z]\.)(?<!\w\.\w)(?<![A-Z][a-z]\.)(?<=[.?])\s

See a regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
0

If you want a non-regex based solution, you can use nltk here.

import nltk

txt_1 = "C. Daniel, who love cakes, is taking a trip to Nevada. Not gonna lie, i would go as well."

nltk.sent_tokenize(txt_1)

['C. Daniel, who love cakes, is taking a trip to Nevada.',
 'Not gonna lie, i would go as well.']

txt_2 = "Mr. Daniel, who love cakes, is taking a trip to Nevada. Not gonna lie, i would go as well."
nltk.sent_tokenize(txt_2)

['Mr. Daniel, who love cakes, is taking a trip to Nevada.',
 'Not gonna lie, i would go as well.']
Sreeram TP
  • 11,346
  • 7
  • 54
  • 108