2

Is it possible to specify a regex that matches x...(a and b)...y within a limited length of n?

To be more precise:

  1. The length of chars between the matched x and y must be at most n.
  2. Both an a and a b (regardless of order) must exist between the matched x and y.
  3. x, a, b, and y here could stand for a multi-char string snippet.

Test cases (assume n = 10):

Match:
...x..a...b..y...
...x..b...a..y...
...x..a...b..y...a...
...x..a...b..y...b...
...x..a...b..y...y...
...x..a.x.b..y...
...x..a.y.b..y...
...xaabbaabbay...
...x..a...b..y... ...x..a...b..y... (2 matches)
...xaby...xaby... (2 matches)

Don't match:
...x..a...b......
...x......b..y...
...x..a......y...
...x..a...b......y...
...x......b..y...a...
...x..a......y...b...
...x..a.y.b......
...x..a.y.b......y...
...x..a.y.b.x....y...
.a.x......b..y...
.b.x..a......y...

P.S: I know that this can be done by simply match /x.{0,n}y/ and then check whether a and b both exist in the matched string in many programming languages. However, this question explicitly requests for a single regex approach, so that it can be used as a query in some applications, such as Google Doc and Notepad++.

Danny Lin
  • 2,050
  • 1
  • 20
  • 34
  • Yes, it is possible (all patterns with limited length are), but will be cumbersome. Can you use lookahead? – Bergi Sep 05 '21 at 01:29
  • I can. Can you just craft one working regex for the question? – Danny Lin Sep 05 '21 at 03:13
  • The tag excerpt says "all questions with this tag should also include a tag specifying the applicable programming language or tool". Can you do that, please? I recall non-fixed width lookarounds isn't supported in all regex engines/flavours. DId you mean to use `a` in `/a.{0,n}y/`. Your explanation seems to indicate you mean `x`. – Scratte Sep 05 '21 at 07:08
  • Is there any risk that there's another y after the first one? Note that any solution to this is likely to only match single characters, where you'll not be able to replace "a", "b" nor "y" with a series of characters in a proposed regex solution. – Scratte Sep 05 '21 at 07:16
  • @Scratte This question is about a general single regex solution that can do the work, which is mostly as a query of application. It shouldn't be dependent of particular programming language. If it's insisted that some language/tool be specified, we'll talk about Notepad++ as the example. – Danny Lin Sep 05 '21 at 07:39
  • @Scratte Yes, it's possible that there's another y after the matched one. A single regex solution is more likely be used as a query of an application to quickly jump between matches. Further replacing or so is likely not to be cared about—if it were, a technique like the "match x...y then check" mentioned in the OP would probably be applied. – Danny Lin Sep 05 '21 at 07:51
  • 1
    Kindly add that the regex should match `x...a.y.b...y` to your Question. And that it should not f.ex. match `x...a.y.b..1234567890..y`, which with the first requirement becomes an added trickyness. – Scratte Sep 05 '21 at 09:14
  • So for example `...x12a456b89y... ...x12a456b89y... ` should give 2 matches? – The fourth bird Sep 05 '21 at 11:32
  • 1
    @Thefourthbird Yes – Danny Lin Sep 05 '21 at 11:34
  • Perhaps like this? `x(?![^x\n]{11,}y)(?:[^a\n]*a[^b\n]*b|[^b\n]*b[^a\n]*a)[^y\n]*y` https://regex101.com/r/pdotmC/1 – The fourth bird Sep 05 '21 at 12:08
  • @Thefourthbird This doesn't match `...x12a456b89y123y...` – Danny Lin Sep 05 '21 at 12:23
  • Is `...xaabbaabby...` supposed to match or not ? – Toto Sep 05 '21 at 14:24
  • @Toto Sure. It perfectly matches all criteria. – Danny Lin Sep 05 '21 at 14:44
  • 1
    @Thefourthbird That also matches `...x...a.yx.b...y...` but shouldn't – Bergi Sep 05 '21 at 14:47
  • @DannyLin You [say](https://stackoverflow.com/questions/69059958/regex-for-x-a-and-b-y-within-a-limited-length#comment122058125_69061579) `x...a.y.b...y` must match, but there are 11 chars between `x` and `y`. Are you sure? – Wiktor Stribiżew Sep 05 '21 at 21:23
  • @WiktorStribiżew I didn't say that. I don't know how you have got the idea as your link points to a non-existing comment. Anyway, I agree that `x...a.y.b...y` shouldn't be matched in the context of `n` == 10. – Danny Lin Sep 06 '21 at 01:30

2 Answers2

2

I believe you can use

x(?:(?(1)(?!)|(a))|(?(2)(?!)|(b))|.){0,10}?y(?(1)|(?!))(?(2)|(?!))

See the regex demo. Details:

  • x - left-hand delimiter
  • (?:(?(1)(?!)|(a))|(?(2)(?!)|(b))|.){0,10}? - zero to ten occurrences (but as few as possible) of
    • (?(1)(?!)|(a)) - if Group 1 is null (if Group 1 was not matched before) match a, else, fail to trigger backtracking right away
    • | - or
    • (?(2)(?!)|(b)) - if Group 2 is null (if Group 2 was not matched before) match b, else, fail to trigger backtracking right away
    • |. - or any one char (other than line break char by default)
  • y - right-hand delimiter
  • (?(1)|(?!)) - conditional construct: if Group 1 participated in the match (if Group 1 value is not null), proceed, else, fail the match
  • (?(2)|(?!)) - conditional construct: if Group 2 participated in the match (if Group 2 value is not null), proceed, else, fail the match.

The last two conditionals make sure there is a and b in the matched text.

NOTE: If a and b can be multicharacter strings, n must be reduced to the length that is n-(a.length-b.length+2). So, if a is ac and b is bmc, you need to replace 10 with 10-(2-3+2) => 7. This also means you should check your expected limit length restriction beforehand, it should allow the length of both a and b combined.

NOTE 2: Add (?s) at the beginning of the pattern to match line break chars with . as well. See How do I match any character across multiple lines in a regular expression? for more options.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thank you. Your proposed solution looks promising. Unfortunately this fails on the `...xaby...xaby...` case by not giving 2 matches. Is there a way to revise it to get it work for such case? – Danny Lin Sep 06 '21 at 01:47
  • 1
    Using the concept of conditional, I've got it work using `x(?:(?=(?(1)(?!)|(a))|(?(2)(?!)|(b))|).){0,10}?y(?(1)|(?!))(?(2)|(?!))`. This captures both groups using lookahead assertion as a smarter way, and eliminates the need of a pre-modification of length if `a` or `b` is multi-char. – Danny Lin Sep 06 '21 at 02:23
  • @DannyLin My attempt was `x(?:(?(1)(?(2)[^y]|[^b])|(a))|(?(2)(?(1)[^y]|[^a])|(b))|[^xy]){0,10}y(?(1)|(?!))(?(2)|(?!))` but yours is nicer and only needs the non-greedy repetition – Bergi Sep 06 '21 at 02:24
  • Slightly modified to `x(?:(?=(?(1)|(a))|)(?=(?(2)|(b))|).){0,10}?y(?(1)|(?!))(?(2)|(?!))` for better simplicity and supports the case that `a` and `b` may overlap each other (e.g. `a` = `az` and `b` = `azz`). – Danny Lin Sep 06 '21 at 02:43
  • @DannyLin You do not need the lookaheads. Just remove dots from the conditionals inside the consuming part of the pattern and make the limiting quantifier lazy. I edited the solution. – Wiktor Stribiżew Sep 06 '21 at 07:11
  • Your revised version does work, but takes much more steps as it can't fail early. Using my test cases, your version takes 35720 steps and mime takes 2847 steps. There are still other benefits of my version as mentioned above. – Danny Lin Sep 06 '21 at 08:57
  • @DannyLin I am afraid that is already a bit farther from the question you asked as you did not mention overlapping. Your regex above will return unwelcome matches if `a` or `b` actually can overlap with `y`, when `a` or `b` is longer than `y`. See [this regex demo](https://regex101.com/r/TCja1Z/1). In this case, you need to add more requirements. – Wiktor Stribiżew Sep 06 '21 at 09:37
  • @WiktorStribiżew You are right, overlapping of `a`, `b`, and `y` are special rare cases that are not intended to be covered in the original question. However, the elimination of calculation of my version is still a welcomed benefit. Also, I'd suggest revising your version to something like `x(?:(?(1)(?!)|(a))|(?(2)(?!)|(b))|.){0,10}?y(?(1)|(?!))(?(2)|(?!))` for better performance (this efficiently reduces the steps from 35720 to 2831 for my test cases). – Danny Lin Sep 06 '21 at 10:23
  • @DannyLin Right, failing when Group 1 or 2 is already captured will trigger backtracking earlier. – Wiktor Stribiżew Sep 06 '21 at 10:27
  • @WiktorStribiżew Thank you. An extra question: is there an alternative for a regex engine without conditional support, such JavaScript? – Danny Lin Sep 07 '21 at 11:07
  • @DannyLin Try a pattern like `x(?=(?:(?!x|y).){0,9}a)(?=(?:(?!x|y).){0,9}b).{0,10}?y`. See [regex demo](https://regex101.com/r/zm4PFF/4). – Wiktor Stribiżew Sep 07 '21 at 11:15
  • @WiktorStribiżew Obviously it doesn't work for cases like `...x..a.x.b..y...` or `...x..a.y.b..y...`, and can be used only in restricted cases. – Danny Lin Sep 07 '21 at 12:07
0

I fear this is not possible without generating all valid combinations of lengths for the parts between x and a|b, between a and b, and between a|b and y.

For n=5 (i.e. 0-3 characters other than the a and b between x and y:

x ( [^ab]{0} ( a ( [^b]{0} b .{0,3}
                 | [^b]{1} b .{0,2}
                 | [^b]{2} b .{0,1}
                 | [^b]{3} b .{0}   )
             | b ( [^b]{0} a .{0,3}
                 | [^b]{1} a .{0,2}
                 | [^b]{2} a .{0,1}
                 | [^b]{3} a .{0}   ) )
  | [^ab]{1} ( a ( [^b]{0} b .{0,2}
                 | [^b]{1} b .{0,1}
                 | [^b]{2} b .{0}   )
             | b ( [^b]{0} a .{0,2}
                 | [^b]{1} a .{0,1}
                 | [^b]{2} a .{0}   ) )
  | [^ab]{2} ( a ( [^b]{0} b .{0,1}
                 | [^b]{1} b .{0}   )
             | b ( [^b]{0} a .{0,1}
                 | [^b]{1} a .{0}   ) )
  | [^ab]{3} ( a ( [^b]{0} b .{0}   )
             | b ( [^b]{0} a .{0}   ) ) ) y

I used [^…] instead of . to avoid too much backtracking from the wrong branch, though it could probably be optimised a bit further.

Bergi
  • 630,263
  • 148
  • 957
  • 1,375
  • I agree that crafting a regex that enumerates all combinations of `x...a...b...y` within length `n` would work. But this is probably only doable programmatically and is practical in vary limited/rare cases. – Danny Lin Sep 05 '21 at 15:00
  • @DannyLin Yes. I don't think there's a practical solution using only advanced regex combinators - even with lookaround, conditionals or recursion/subroutines. Maybe I took the question "*Is it possible*" a bit too literally - it is, what you describe is a regular language still. – Bergi Sep 05 '21 at 15:28
  • Looks like I was wrong - it *can* be solved with conditionals. Haven't used those myself yet – Bergi Sep 05 '21 at 21:44