4

My regex does not pick the closest 'cont' pair to the inner text. How can I fix that?

Input:

cont cont ItextI /cont /cont

Regex:

cont.*?I(.*?)I.*?/cont

Match:

cont cont ItextI /cont

Match I need:

cont ItextI /cont
Undo
  • 25,519
  • 37
  • 106
  • 129
snowindy
  • 3,117
  • 9
  • 40
  • 54

2 Answers2

12
cont(?:(?!/?cont).)*I(.*?)I(?:(?!/?cont).)*/cont

will only match the innermost block.

Explanation:

cont        # match "cont"
(?:         # Match...
 (?!/?cont) # (as long as we're not at the start of "cont" or "/cont")
 .          # any character.
)*          # Repeat any number of times.
I           # Match "I"
(.*?)       # Match as few characters as possible, capturing them.
I           # Match "I"
(?:         # Same as above
 (?!/?cont)
 .
)*
/cont       # Match "/cont"

This explicitly forbids cont or /cont to appear between the opening cont and the to-be-captured text (and between that text and the closing /cont).

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
2

The reason you match on cont cont ItextI /cont is that the regex matches the first part of your pattern cont on the first "cont", then it uses the reluctant .*? to gobble up the whitespace, next cont and whitespace preceding ItextI. When it reached ItextI, it recognizes the I as matching the next part of the pattern, and continues with the rest of the regex. As minitech writes, this is because the regex is working from the beginning of the string and finding the earliest possible match.

If you can make assumptions about the whitespace, you can write:

cont\s+I(.*?)I\s+/cont

This will match in your example above.

beerbajay
  • 19,652
  • 6
  • 58
  • 75
  • no, there can be anything, not only whitespaces After some research in look-behind regex feature I found the solution: (?<=cont).*?I(.*?)I.*?/cont Works well in AS and Java – snowindy Feb 05 '12 at 16:34
  • Okay, it would be helpful if you provided a more complete example in the future, as the input text above is a bit misleading. – beerbajay Feb 05 '12 at 16:35