3

Edit: This had been closed as too broad without justification and I believe it's a mistake. One question was also proposed as a duplicate : Regex: Determine if two regular expressions could match for the same input? It is similar but mine is an R question which asks not if two expressions match the same, but if one's matches contains the matches of the other, and my ideal output is a practical function, or cues that I can use to build one, the theory linked is a bit overwhelming for a R programmer. As a side note this linked question has 40+ votes and was never closed despite being broader than mine.


I want to assess if a regex pattern will match in any circumstances what another pattern matches.

For instance .* matches everything, so, naming fun the function I would like to have, we'd have fun(".*", "foo") to be TRUE because if "foo" is matched, ".*" will be matched as well.

Other calls returning TRUE would be:

fun("[ab]","a")
fun("\\D","^\\D{2}$")

If I revert the arguments of the examples above, or if I have two patterns that might overlap, it should return FALSE.

I want to be able to do this from the patterns alone, not by testing them on data first. How can I do that ?

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • why not test the input string with both of the pattern and return value accordingly – Code Maniac Jul 13 '19 at 13:43
  • because I'm thinking of applications with hundreds of thousands of input strings – moodymudskipper Jul 13 '19 at 14:02
  • 2
    You might find [this](https://stackoverflow.com/q/3410256/4934172) or [this](https://stackoverflow.com/q/8961632/4934172) useful. – 41686d6564 stands w. Palestine Jul 13 '19 at 14:40
  • Thanks Ahmed, it seems to show that it's possible (so not too broad, hint hint) – moodymudskipper Jul 13 '19 at 14:50
  • @Moody_Mudskipper The question is probably too broad because it doesn't show your attempts to solve the problem, _not_ because whether or not what you're trying to do is possible. – 41686d6564 stands w. Palestine Jul 13 '19 at 15:19
  • I have absolutely no idea how to solve this problem, so the best thing I can do is describe it as well as I can, give an input and the expected output. I'll welcome all suggestions to improve these aspects. – moodymudskipper Jul 13 '19 at 15:34
  • 1
    Possible duplicate of [Regex: Determine if two regular expressions could match for the same input?](https://stackoverflow.com/questions/3410256/regex-determine-if-two-regular-expressions-could-match-for-the-same-input) – Jimi Jul 13 '19 at 20:44
  • 1
    Voted to reopen (I don't think this is too broad), but I think should probably be reclosed as a clear duplicate of the question @Jimi linked. – Linuxios Jul 14 '19 at 16:44
  • 1
    @Moody_Mudskipper: I think it was likely an honest mistake. Stuff can often get over-zealously closed through the review queues; it's the price we pay for having systems to more effectively moderate the site. I came to this question through the reopen queue, so hopefully enough others will come by and vote to reopen. – Linuxios Jul 14 '19 at 16:50
  • 2
    @Linuxios No problem to vote for reopening. It could be marked as duplicate after, since it's asking for the results of an intersection anyway. But it could be decided after (it could also be more useful this way, mixing a *You can't*, with a *You could, if you want to blow up your brain - and probably your code, possibly the universe - delving into this matter a little deeper*). – Jimi Jul 14 '19 at 18:33
  • Maybe @Moody_Mudskipper should give an example of a regexp usage in r, to make it less broad and verify we are talking about stringr. 1. [The duplicate example](https://stackoverflow.com/questions/3410256/regex-determine-if-two-regular-expressions-could-match-for-the-same-input) is actually broader than this, as they did not mention which regexp they are talking about 2. [This theoretical question](https://stackoverflow.com/questions/8961632/how-should-one-proceed-to-prove-or-find-if-two-regular-expressions-are-same-or) is not a duplicate as it deals with theoretical regular expressions only – Shahar Bental Jul 14 '19 at 21:07
  • @Moody_Mudskipper are there, in practice, any restrictions on the kinds of regular expressions you need to answer this question for? If it turns out that you’re only interested in a simpler subset, a much less theoretically demanding solution could be possible. Otherwise, you’re going to be getting into so,e deeper theoretical work no matter what you do. – Linuxios Jul 14 '19 at 22:34
  • I think there might be, I'll sleep on it and will update my question with some details about my use case when I get the chance, to answer both Shahar's and your comment. – moodymudskipper Jul 14 '19 at 23:42

1 Answers1

2

Assuming you are using the standard perl-compatible regex, which is not really regular, but actually much-much stronger, the answer is You can't.

Regex is strong enough to do really cool things like Solving 3-Sat with regex.

Any regex that allows groups and lookahead is somewhat of a CFG (sometimes it seems to be weaker, but sometimes stronger), so I feel like you are trying to directly solve an undecidable problem

More directly, it might be easier to show that your solution solves the Post correspondence problem, which is undecidable.

Assume, for sake of contradiction, that you wrote your function successfully. We are now trying to solve the post correspondence problem, so we are getting 2 lists of words. Throughout what's next, I will try using the example from wikipedia, that is, the lists I will demonstrate on are ['a','ab','bba'] and ['baa','aa','bb'].

As a first argument, we will give your function something similar to ^$. As a second, we will construct a regex that accepts a word if it is a solution to the post correspondence problem.

We will construct the regex so that each group corresponds to a specific word, and then ask the regex to match a word to the repetition of both the first and second list.

Here's an example:

^(?:(a))(?:(ab))(?:(bba))XXX(?:(baa))(?:(aa))(?:(bb))YYY(?:(\1?\2?\3?)*$)(?:(\4?\5?\6?)*$)

Up until the first separator, XXX, we assign group \1 to a, and so on. Up until the second separator, YYY, we assign group \4 to baa and so on.

After the second separator, we demand to find either a repetition of the first set of words, which is also a repetition of the second set of words.

Here's the example in rubular

(?:(a))(?:(ab))(?:(bba))XXX(?:(baa))(?:(aa))(?:(bb))

Would make \1 match a and \4 match baa, if our "test text" starts with aabbbaXXXbaaaabb

To wrap it up, instead of the empty string, we check if this is contained in the regex that is just the prefix, i.e. whether fun("^aabbbaXXXbaaaabbYYY$",<our huge regex>) is True

This solves the post correspondence problem, which is undecidable.

We used the more "advanced" regex features to make it work. If we were to not use them, we would get 2 regular languages. Comparing those is both a theoretical and practical task of constructing an automata that travels on both, and then check if the accepting states of the second machine are all included in the accepting states of the second machine.

Shahar Bental
  • 991
  • 6
  • 16
  • Thanks for the very detailed answer, and the great links too, I appreciate, it might take me a lifetime to completely understand it though :). My reasoning was that if I, a human, can solve the examples used in my question, an algorithm could do it too, but I understand you imply that neither human nor machine can solve this issue in a general manner. Would it be more reasonable to ask : which algorithm would solve all the solvable cases of this problem ? – moodymudskipper Jul 14 '19 at 11:49
  • It can all be find in standard textbooks like Introduction to the Theory of Computation. Specifically, you should be able to solve "regular" expressions that only contains concatenation, kleene star and plus, and or. My intuition is that dynamic programming can solve it at reasonable time. For example, consider exp1 and exp2, and assume substring(exp1,1,2)==".*" and substring(exp2,1,1) is some letter (or wild card) then I think fun(exp1,exp2)=fun(exp1,substring(exp2,2,nchar(exp2)) | fun(substring(exp1,3,nchar(exp1)),exp2)|fun(substring(exp1,3,nchar(exp1)),substring(exp2,2,nchar(exp2))) – Shahar Bental Jul 14 '19 at 21:15