2

string: XXaaaXXbbbXXcccXXdddOO

I want to match the minimal string that begin with 'XX' and end with 'OO'.

So I write the non-greedy reg: r'XX.*?OO'

>>> str = 'XXaaaXXbbbXXcccXXdddOO'
>>> re.findall(r'XX.*?OO', str)
['XXaaaXXbbbXXcccXXdddOO']

I thought it will return ['XXdddOO'] but it was so 'greedy'.

Then I know I must be mistaken, because the qualifier above will firstly match the 'XX' and then show it's 'non-greedy'.

But I still want to figure out how can I get my result ['XXdddOO'] straightly. Any reply appreciated.

Till now, the key point is actually not about non-greedy , or in other words, it is about the non-greedy in my eyes: it should match as few characters as possible between the left qualifier(XX) and the right qualifier(OO). And of course the fact is that the string is processed from left to right.

joooohnli
  • 171
  • 1
  • 8
  • 2
    Perhaps `re.findall(r'XX[^X]*OO', str)`? – devnull Jan 25 '14 at 11:15
  • There's nothing about being greedy in your regex; the non-greedy quantifier would have had an effect if you had input like `XXaaaOOXXbbbOOXXcccOOXXdddOO`. – devnull Jan 25 '14 at 11:17
  • possible duplicate of [Difference between .\*? and .\* for regex](http://stackoverflow.com/a/3075532/7586) – Kobi Jan 25 '14 at 11:37
  • @devnull I know I am confused by some concepts, but I want to realize the 'greedy' in my subconscious. And if the input is `XXaaaXXbbbXXcccXXdXddOO`, then `re.findall(r'XX[^X]*OO', str)` will not work. – joooohnli Jan 25 '14 at 11:47
  • @joooohnli Yes, it's evident that you're getting confused. Please refer to the linked question to get a better handle on greedy vs. non-greedy regular expressions. – devnull Jan 25 '14 at 11:54

4 Answers4

5

How about:

.*(XX.*?OO)

The match will be in group 1.

Toto
  • 89,455
  • 62
  • 89
  • 125
2

Regex work from left to the right: non-greedy means that it will match XXaaaXXdddOO and not XXaaaXXdddOOiiiOO. If your data structure is that fixed, you could do:

XX[a-z]{3}OO

to select all patterns like XXiiiOO (it can be adjusted to fit your your needs, with XX[^X]+?OO for instance selecting everything in between the last XX pair before an OO up to that OO: for example in XXiiiXXdddFFcccOOlll it would match XXdddFFcccOO)

Robin
  • 9,415
  • 3
  • 34
  • 45
2

Indeed, issue is not with greedy/non-greedy… Solution suggested by @devnull should work, provided you want to avoid even a single X between your XX and OO groups.

Else, you’ll have to use a lookahead (i.e. a piece of regex that will go “scooting” the string ahead, and check whether it can be fulfilled, but without actually consuming any char). Something like that:

re.findall(r'XX(?:.(?!XX))*?OO', str)

With this negative lookahead, you match (non-greedily) any char (.) not followed by XX

mont29
  • 348
  • 3
  • 12
1

The behaviour is due to the fact that the string is processed from left to right. A way to avoid the problem is to use a negated character class:

XX(?:(?=([^XO]+|O(?!O)|X(?!X)))\1)+OO
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125