3

I have the following regex that is supposed to find sequence of words that are ended with a punctuation. The look ahead function assures that after the match there is a space and a capital letter or digit.

pat1 = re.compile(r"\w.+?[?.!](?=\s[A-Z\d])"

What is the function of the following lookahead?

pat2 = re.compile(r"\w.+?[?.!](?=\s+[A-Z\d])"

Is Python 3.2 supporting variable lookahead (\s+)? I do not get any error. Furthermore I cannot see any differences in both patterns. Both seem to work the same regardless the number of blanks that I have. Is there an explanation for the purpose of the \s+ in the look ahead?

Matt Fenwick
  • 48,199
  • 22
  • 128
  • 192
andreSmol
  • 1,028
  • 2
  • 18
  • 30

2 Answers2

2

The difference is that the first lookahead expects exactly one whitespace character before the digit or capital letter while the second one expects at least one whitespace character but as many as possible.

The + is called a quantifier. It means 1 to n as many as possible.

To recap

\s (Exactly one whitespace character allowed. Will fail without it or with more than one.)
\s+ (At least one but maybe more whitespaces allowed.)

Further studying.

I have multiple blanks, the \w.+? continues to match the blanks until the last blank before the capital letter

To answer this comment please consider :

What does \w.+? actually matches?

A single word character [a-zA-Z0-9_] followed by at least one "any" character(except newline) but with the lazy quantifier +?. So in your case, it leaves one space so that the lookahead later matches. Therefore you consume all the blanks except one. This is why you see them at your output.

FailedDev
  • 26,680
  • 9
  • 53
  • 73
  • Thanks FailedDev. When I run the regex with only \s and I have multiple blanks, the \w.+? continues to match the blanks until the last blank before the capital letter. In my result I am getting a text with blanks like :"the car is parked----","in the garage"(--symbolizes blanks). If I have the \s+ in the lookahead, the extra blanks are not captured and I get "the car is parked","in the garage". Regardless how many blanks I have in between the words. Is it correct that python 3 is supporting variable look ahead? – andreSmol Nov 09 '11 at 18:12
2

I'm not really sure what you are tying to achieve here.

Sequence of words ended by a punctuation can be matched with something like:

re.findall(r'([\w\s]*[\?\!\.;])', s)

the lookahead requires another string to follow?

In any case:

  • \s requires one and only one space;
  • \s+ requires at least one space.

And yes, the lookahead accepts the "+" modifier even in python 2.x

The same as before but with a lookahead:

re.findall(r'([\w\s]*[\?\!\.;])(?=\s\w)', s)

or

re.findall(r'([\w\s]*[\?\!\.;])(?=\s+\w)', s)

you can try them all on something like:

s='Stefano ciao.   a domani. a presto;'

Depending on your strings, the lookahead might be necessary or not, and might or might not change to have "+" more than one space option.

Stefano
  • 18,083
  • 13
  • 64
  • 79