1

Let's say I have one regular expression A and another regular expression B as input. I want to create a new regular expression C which matches a line if and only if

  • A matches the line and
  • B does not match the line.

I am able to manually create C for very simple cases of A and B: Let's say A is x and B is y, then C = ^[^y]*x[^y]*$ would be a valid solution.

Obviously, the problem gets harder as A and B get more complex. Is there a generic algorithm for creating such a regular expression C out of A and B?


Note: Since regular languages are closed under intersection and complement, such an algorithm should theoretically exist. I am aware that the expressive power of regular expressions available in modern IT systems exceeds that of formal regular languages, but a solution where A and B are restricted to the subset of features available in formal languages, but C uses extended features of modern-day regex engines, is perfectly fine for me.

Heinzi
  • 167,459
  • 57
  • 363
  • 519
  • You could try `^(?!B)A$`, but it really might depend on the actual regular expressions: `^(?!y)x$` – ctwheels Jul 18 '19 at 16:15
  • What happens in your example if `y` is a substring of `x` ? Your approach only works when there are disjoint matches I think. While the conversion should be possible in theory, I think that testing a string with two distinct patterns combining the boolean results ( `/A/.test(string) and not (/B/.test(string))` in pseudo-syntax ) will be easier. – collapsar Jul 18 '19 at 16:33

2 Answers2

2

Edit

Based on the OP's initial regex and as pointed out by @ruakh in the comments below my answer, the OP has chosen to use ^(?!.*B).*A. This solution matches any strings that contain B, rather than what my original post (below) targeted, which is any string that matches B as was originally assumed and later clarified (in the comments below my answer) by the OP.


Original Post

If I understand your question correctly, you're looking to match a string that matches one given pattern A, but not match pattern B, such that your new pattern C is comprised of both A and B.

Simple regex

Given that the pattern A is x and the pattern B is y, the new regex pattern C should be as follows:

^(?!B$)A$

or with the sample regex you presented:

^(?!y$)x$

Maybe a better example to demonstrate this is with the following:

  • A pattern: x.
  • B pattern: xx
  • C becomes: ^(?!xx$)x.$

This would match xa but not xx as seen here


Complex regex

With regards to more complex regular expressions, it might depend on the patterns entirely and the regex engine that is used. The regular expression could time out and if recursion, control verbs, pattern modifiers, etc. are used, it could break entirely.

A better option would be to evaluate both regular expressions independently with code to determine the outcome.

Example 1

Here's an example where the regular expression breaks given that both patterns use the same predefined pattern name:

  • A: (?(DEFINE)(?<t>x))(?&t).
  • B: (?(DEFINE)(?<t>x))(?&t){2}
  • C: ^(?!(?(DEFINE)(?<t>x))(?&t){2}$)(?(DEFINE)(?<t>x))(?&t).$

It fails as shown here

Example 2

Here's a recursion example that fails to work properly:

  • A: a(?R)z
  • B: az
  • ^(?!az$)a(?R)?z$

It fails as shown here


Of course, this assumes that the initial assumption that C: ^(?!B$)A$ is the correct pattern to use for the matching of A and non-matching of B.

ctwheels
  • 21,901
  • 9
  • 42
  • 77
  • `^(?!B)A$` is a practical solution, but the negative lookahead pattern used is a match specification for a regex engine which lietarlls speaking is beyond the regex language of theoretical CS. Moreover, the lookahead does not specify that `B` must match the whole line but only requires that it matches no prefix (so it should possibly be `^(?!B$)A$`). It is however unclear whether that is hat the OP wants (his example only makes sense when partial matches are allowed, while the verbal description suggests complete matches. – collapsar Jul 18 '19 at 16:43
  • @collapsar yes and the negative lookahead fails when you combine it with more complex use-cases like pre-define patterns. It only works for some patterns, but others will fail (which I tried to demonstrate in practice) – ctwheels Jul 18 '19 at 16:46
  • The pattern names should not be a problem since these are bound variables to their respective regex and need a simple relabelling - or am I missing something here ? Moreover, I am not sure about the expressive power of predefined pattern names, but I suspect that this allows for context-free grammars. – collapsar Jul 18 '19 at 16:49
  • [This SO answer](https://stackoverflow.com/a/11382641/) confirms that allowing for lookarounds of unlimited length and recursion goes beyond the notion of regular languages from TheoCS. Recursion in particular is a killer as it permits recognizing CF languages which are not close under intersection nor complement. – collapsar Jul 18 '19 at 16:58
  • Wrt anchor use: say `A=xx` and `B=x`, the negative lookahead without `$` in the combined regex would reject the line `xx` whereas is should be accepted, if OP requires that `B` do not match the complete line. However, adding `$` to the NLA resolves the problem. – collapsar Jul 18 '19 at 17:02
  • @collapsar I edited my answer with `$`; realized the same – ctwheels Jul 18 '19 at 17:03
  • 1
    Your first example should be `^(?!.*B).*A` rather than `^(?!B$)A$`, and similarly for various of your other examples. (The question says: "Let's say **A** is `x` and **B** is `y`, then **C** = `^[^y]*x[^y]*$` would be a valid solution.") – ruakh Jul 18 '19 at 18:50
  • @ruakh only if the OP specifies "contains" or "ends with". The OP didn't make that specification. – ctwheels Jul 18 '19 at 19:13
  • @ctwheels: Not so; did you read the OP's example that I quoted? I'll grant that the question is poorly written -- the OP makes lots of assumptions about what "regular expression" and "match" means, apparently not realizing that they mean different things in different contexts -- but the example makes it clear that if **B** is `y`, then **B** "matches" any line *containing* `y`. – ruakh Jul 18 '19 at 19:37
  • @ctwheels: Thanks, `^(?!.*B).*A$` works perfectly for my use cases. ruak is right: Since most regex engines interpret `x` (without `^` and `$`) as "contains `x`", that's the interpretation that I have used as well. – Heinzi Jul 19 '19 at 11:16
0

I'm guessing that the answer is most likely no because A, B, and C can be dependent and independent expressions, then the outcomes would fall into combination category, which also includes the permutation instances and there would be an infinite number of such expressions. Then, I highly doubt that there would be one generic algorithm for.

Emma
  • 27,428
  • 11
  • 44
  • 69