1

I have this sentence :

<i>foo 42 </i> <i>(bar)</i>

If i try to match it with this regex:

<i>(foo \d+\s*.+?)(\(bar\))

The group 1 of the result is:

foo 42 </i> <i>

However, if I put a ? at the end of the regex like this because (bar) may or may not be there:

<i>(foo \d+\s*.+?)(\(bar\))?

The group 1 of the result becomes:

foo 42 <

How can i get foo 42 </i> <i> with having the ? quantifier for the (bar) group?

Thank you

Martin J
  • 2,019
  • 1
  • 15
  • 28
  • 1
    Are you trying to parse html with regex? You get that match because if you make `(\(bar\))?` optional this part `\s*.+?` matches a space and `<` due to the + and ?. Why not match then closing and the opening tag? – The fourth bird Aug 04 '19 at 09:54
  • Please describe in words what you want. Getting everything before `(bar)` when `(bar)` is not there makes no sense. – Oleg Aug 04 '19 at 09:57
  • I am trying to get everything before `(bar)` because there may be important info between `` and `` of `foo 42 ` that I want to keep in my first group. – Martin J Aug 04 '19 at 10:05
  • You have a lazy quantifier on your `.+` so it will only take _as much as it needs to_. You then make `(bar)` optional. So why would the engine take any more? You've told it to be lazy and it is - you need to think more carefully about what you're trying to match. Obviously if you make your `.+` greedy it will grab absolutely everything in the case `(bar)` is missing. So the real question becomes, when `(bar)` isn't the terminator, what is? – Boris the Spider Aug 04 '19 at 10:06
  • okay so I understand when the terminator is optional the lazy quantifier will stop as early as possible. I thought it would be more "dynamic" and understand it can continue if the terminator is not here and stop at the terminator if it's here. – Martin J Aug 04 '19 at 10:46
  • What is the right hand boundary? ``? You must tell the regex where to stop matching, otherwise, it will match either too little or too much. – Wiktor Stribiżew Aug 04 '19 at 10:49
  • Either https://regex101.com/r/StM4v3/1 or https://regex101.com/r/StM4v3/2 is the answer. – Wiktor Stribiżew Aug 04 '19 at 16:14

1 Answers1

1

The point is the all optional subpatterns after a lazy dot pattern only match their patterns if this match happens right after one or zero chars.

That is, <i>(foo \d+\s*.+?)(\(bar\))? will grab (bar) if it follows 0 or more whitespaces and 1 char, like in <i>foo 42 <(bar)</i> or <i>foo 42<(bar)</i> (see demo).

Since you want to match up to any optional (bar), you need to make sure the .+? is turned into a tempered greedy token that can be used with a greedy quantifier, but will be tempered, restricted with a negaitve lookahead:

<i>(foo \d+\s*(?:(?!\(bar\)).)*)(\(bar\))?

Or, if you need to match the closest foo <digits> to the (bar):

<i>(foo \d+\s*(?:(?!\(bar\)|foo \d).)*)(\(bar\))?

See Regex 1 and Regex 2 demos.

Details

  • <i> - literal string
  • (foo \d+\s*(?:(?!\(bar\)|foo \d).)*) - Group 1:
    • foo \d+ - foo, space and 1+ digits
    • \s* - 0+ whitespaces
    • (?:(?!\(bar\)|foo \d).)* - any char, 0 or more occurrences as many as possible, that does not start a (bar) or foo, space, a digit char sequences
  • (\(bar\))? - an optional Group 2: (bar) substring.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563