Python regex, conditional searching

Question

I am trying to split this sentence

"Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot " \
"for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this " \
"isn't true... Well, with a probability of .9 it isn't."

Into list of below.

Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.

Code:

print re.findall('([A-Z]+[^.].*?[a-z.][.?!] )[^a-z]',text)

Output:

['Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid
 a lot for it. ', "Adam Jones Jr. thinks he didn't. "]

K gud, but it missed some, is there a way to tell Python since last [^a-z] isn't part of my group, pls continue searching from there.

EDIT:

This was achieved through forward look ahead regex as mentioned by @sputnick.

print re.findall('([A-Z]+[^.].*?[a-z.][.?!] )(?=[^a-z])',text)

Output:

['Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid
 a lot for it. ', 'Did he mind? ', "Adam Jones Jr. thinks he didn't. "
, "In any case, this isn't true... "]

But we still need the last sentence. Any ideas?

related: [Python - RegEx for splitting text into sentences (sentence-tokenizing)](http://stackoverflow.com/q/25735644/4279). — jfs, Dec 26 '14 at 06:03

Gilles Quénot · Accepted Answer · 2014-12-25T21:38:37.777

2

Try this :

print re.findall('([A-Z]+[^.].*?[a-z.][.?!] )(?=[^a-z])',text)

using positive look-ahead regex technique, check http://www.regular-expressions.info/lookaround.html

edited Dec 25 '14 at 21:38

answered Dec 25 '14 at 21:31

Gilles Quénot

173,512
41
224
223

wow, regex are awesome, works perfect. Thx @sputnick. What is `?=` actually meant for? – garg10may Dec 25 '14 at 21:36
This is the syntax for _positive look-ahead_, check the added link in my answer – Gilles Quénot Dec 25 '14 at 21:39
nice tutorial at the link, could there be a way also to include the last sentence saying exclude looking after the dot for a space and [^a-z] it its end of file. Something like word boundaries – garg10may Dec 26 '14 at 01:16

score 1 · Answer 2 · answered Dec 26 '14 at 05:27

1

(.+?)(?<=(?<![A-Z][a-z])(?<![a-z]\.[a-z])(?:\.|\?)(?=\s|$))

Try this.See demo.Grab the capture or groups.Use re.findall.

https://regex101.com/r/gQ3kS4/45

answered Dec 26 '14 at 05:27

vks

67,027
10
91
124

score 0 · Answer 3 · answered Dec 26 '14 at 02:18

0

Finally

 print re.findall('[A-Z]+[^.].*?[a-z.][.?!] (?=[^a-z])|.*.$',text)

Above works perfect as needed. Includes the last sentence. But I have no idea why |.*.$ worked pls help me understand.

Output:

['Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid
 a lot for it. ', 'Did he mind? ', "Adam Jones Jr. thinks he didn't. "
, "In any case, this isn't true... ", "Well, with a probability of .9 
it isn't."]

answered Dec 26 '14 at 02:18

garg10may

5,794
11
50
91

1

There is no space at the end: `re.findall('[A-Z]+[^.].*?[a-z.][.?!](?: (?=[^a-z])|$)', text)` – jfs Dec 26 '14 at 05:53

Python regex, conditional searching

3 Answers3