2

I am trying to split this sentence

"Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot " \
"for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this " \
"isn't true... Well, with a probability of .9 it isn't."

Into list of below.

Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.

Code:

print re.findall('([A-Z]+[^.].*?[a-z.][.?!] )[^a-z]',text)

Output:

['Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid
 a lot for it. ', "Adam Jones Jr. thinks he didn't. "]

K gud, but it missed some, is there a way to tell Python since last [^a-z] isn't part of my group, pls continue searching from there.

EDIT:

This was achieved through forward look ahead regex as mentioned by @sputnick.

print re.findall('([A-Z]+[^.].*?[a-z.][.?!] )(?=[^a-z])',text)

Output:

['Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid
 a lot for it. ', 'Did he mind? ', "Adam Jones Jr. thinks he didn't. "
, "In any case, this isn't true... "]

But we still need the last sentence. Any ideas?

garg10may
  • 5,794
  • 11
  • 50
  • 91
  • 1
    related: [Python - RegEx for splitting text into sentences (sentence-tokenizing)](http://stackoverflow.com/q/25735644/4279). – jfs Dec 26 '14 at 06:03

3 Answers3

2

Try this :

print re.findall('([A-Z]+[^.].*?[a-z.][.?!] )(?=[^a-z])',text)

using positive look-ahead regex technique, check http://www.regular-expressions.info/lookaround.html

Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
  • wow, regex are awesome, works perfect. Thx @sputnick. What is `?=` actually meant for? – garg10may Dec 25 '14 at 21:36
  • This is the syntax for _positive look-ahead_, check the added link in my answer – Gilles Quénot Dec 25 '14 at 21:39
  • nice tutorial at the link, could there be a way also to include the last sentence saying exclude looking after the dot for a space and [^a-z] it its end of file. Something like word boundaries – garg10may Dec 26 '14 at 01:16
1
(.+?)(?<=(?<![A-Z][a-z])(?<![a-z]\.[a-z])(?:\.|\?)(?=\s|$))

Try this.See demo.Grab the capture or groups.Use re.findall.

https://regex101.com/r/gQ3kS4/45

vks
  • 67,027
  • 10
  • 91
  • 124
0

Finally

 print re.findall('[A-Z]+[^.].*?[a-z.][.?!] (?=[^a-z])|.*.$',text)

Above works perfect as needed. Includes the last sentence. But I have no idea why |.*.$ worked pls help me understand.

Output:

['Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid
 a lot for it. ', 'Did he mind? ', "Adam Jones Jr. thinks he didn't. "
, "In any case, this isn't true... ", "Well, with a probability of .9 
it isn't."] 
garg10may
  • 5,794
  • 11
  • 50
  • 91
  • 1
    There is no space at the end: `re.findall('[A-Z]+[^.].*?[a-z.][.?!](?: (?=[^a-z])|$)', text)` – jfs Dec 26 '14 at 05:53