1

Suppose I have the following text:

Yes: [x]
Yes: [  x]
Yes: [x  ]
Yes: [  x  ]
No: [x
No: x]

I am interested in a regex expression with two capturing groups as follows:

  • group $1: match the brackets [ and ] that contain an x in between with a variable amount of horizontal space around x. I can use The fourth bird's answer based on a branch reset group for this:

    • regex: (?|(\[)(?=\h*x\h*])|(?<=\[)\h*x\h*(]))

    • which results in:

      First capturing group

  • group $2: should match the x that is contained in the brackets [ and ]. For this, I can use a non-capturing group with a combination of positive lookbehind and lookahead assertions:

    • regex: (?:(?<=\[)\h*(x)(?=\h*]))

    • which results in:

      Second capturing group

The problem is that when I try to join the expressions by OR, the second expression does not match anything. For example:

(?|(\[)(?=\h*x\h*])|(?<=\[)\h*x\h*(]))|(?:(?<=\[)\h*(x)(?=\h*]))

Results in (i.e., see demo with extended flag enabled for clarity):

Both capturing groups combined

My intuition (i.e., perhaps not correct) is that there is no x left to match for the second expression because the x is matched in the first expression (i.e., group $0). For example, simplifying the second expression to (?:(x)) (i.e., see demo) will match just fine the x not comprised in brackets as shown below.

Demonstration of toy example that works

Therefore, I thought I should somehow reset the group $0 matches from the first expression. So I tried adding the \K meta escape to the first expression before (]), but that did not solve anything.

Also, as much as possible, I want to stick to the format (?|regex)|(?:regex)|... because I want to be able to extend the expression further with other groups. I am using Oniguruma regular expressions and the PCRE flavor. Do you have any ideas on how this can be accomplished?

P.S. Apologies if the title of the question is not entirely accurate.

Mihai
  • 2,807
  • 4
  • 28
  • 53
  • `x` is already consumed with `\h*x\h*(])`. – Wiktor Stribiżew Dec 17 '21 at 10:30
  • @WiktorStribiżew, I suspected that this might be the problem. Is there a workaround to forget that `x` has already been consumed? I also tried placing `\h*x\h*` in a non-capturing group as well. – Mihai Dec 17 '21 at 10:35
  • 1
    See https://regex101.com/r/v6zH2m/6, put the second alternation in the branch reset group inside a lookahead to "unconsume" `x`. – Wiktor Stribiżew Dec 17 '21 at 10:39
  • @WiktorStribiżew, oh this is really clever! I suppose it works because, according to [regular-expressions.info](https://www.regular-expressions.info/lookaround.html), *"lookaround actually matches characters, but then gives up the match, returning only the result"*? This is really nice! – Mihai Dec 17 '21 at 10:45
  • 1
    You could also capture the x in a group in the first alternative `(?|(\[)(?=\h*(x)\h*])|(?<=\[)\h*x\h*(]))` See https://regex101.com/r/Es5na6/1 – The fourth bird Dec 17 '21 at 10:53
  • @Thefourthbird, indeed, and that would be a much simpler solution to comprehend. But I was curious how to *unconsume* matches part of the group `$0` because I need to extend this expression further. I have in total seven capturing groups and it helps me to think in terms of `(group 1)|(group 2)...`. But I see your point! – Mihai Dec 17 '21 at 11:01

1 Answers1

1

The main issue is that x is already consumed with \h*x\h*(]) part in the first alternative, and \h*(x) in the second alternative cannot re-match what has already been consumed.

If you put the second alternation in the branch reset group inside a lookahead you can "free up" the x for the second alternative to catch it:

(?|
  (\[) (?= \h* x \h* ] ) | (?<= \[ )(?= \h* x \h* (])) # <--- here
)
|
(?:
  (?<= \[ ) \h* (x) (?= \h* ] )
)

See the regex demo. Pay attenation to the (?=\h*x\h*(])) part: it is now a positive lookahead that only checks for its pattern match immediately on the right, but does not put the matched text to the match value buffer and does not advance the regex index, so that subsequent subpatterns can try to match their patterns against this text.

To accommodate for further alternatives, make sure you use this technique: try to match as close to the start of string as possible, and only consume text that does not have to be rematched, else, use positive lookaheads with capturing groups in them.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563