Regex: Determine if two regular expressions could match for the same input?

Question

I want to find out if there could ever be conflicts between two known regular expressions, in order to allow the user to construct a list of mutually exclusive regular expressions.

For example, we know that the regular expressions below are quite different but they both match xy50:

'^xy1\d'
'[^\d]\d2$'

Is it possible to determine, using a computer algorithm, if two regular expressions can have such a conflict? How?

I have a sneaking suspicion that this problem is equivalent to the Halting Problem. — Seamus Campbell, Aug 04 '10 at 22:15
@Seamus Campbell: Excellent point! Is this the smell of the halting problem? If not, then how can it be implemented? — Tom, Aug 04 '10 at 22:17
Alternative way of thinking about this is.. Why don't you add to the user's regex thus making the regular expressions mutually exclusive? I.e. Tack on the end that it's not the same as the previous one Does that help? — MikeG, Aug 04 '10 at 22:20
I strongly doubt this is equivalent to the halting problem. The halting problem applies to Turing Machines, which are on the top of the Chomsky hierarchy (http://en.wikipedia.org/wiki/Chomsky_hierarchy). Regular expressions are on the bottom of the hierarchy. I suspect this is a solvable problem, although I don't specifically know that it is or how to do it. — John Kugelman, Aug 04 '10 at 22:21
Tom, maybe you should say why you want a list of user entered regexes, might help with some good alternative suggestions coming your way! — MikeG, Aug 04 '10 at 22:27
Uh, neither of the two regexes you provided matches `xy50`... — Tim Pietzcker, Apr 10 '12 at 14:17

score 40 · Accepted Answer · edited Aug 02 '18 at 20:30

There's no halting problem involved here. All you need is to compute if the intersection of ^xy1\d and [^\d]\d2$ in non-empty.

I can't give you an algorithm here, but here are two discussions of a method to generate the intersection without resorting the construction of a DFA:

http://sulzmann.blogspot.com/2008/11/playing-with-regular-expressions.html

And then there's RAGEL

http://www.complang.org/ragel/

which can compute the intersection of regular expressions too.

UPDATE: I just tried out Ragel with OP's regexp. Ragel can generate a "dot" file for graphviz from the resulting state machine, which is terrific. The intersection of the OP's regexp looks like this in Ragel syntax:

('xy1' digit any*) & (any* ^digit digit '2')

and has the following state machine:

While the empty intersection:

('xy1' digit any*) & ('q' any* ^digit digit '2')

looks like this:

So if all else fails, then you can still have Ragel compute the intersection and check if it outputs the empty state machine, by comparing the generated "dot" file.

A bit late to add this in, but a simple proof works like this: After you add intersect complement of the second with the first, you can test to see if it is non-empty by trying all combinations of letters in the alphabet with at most as many characters as there are states in your FSM. The rest is just the pumping lemma since the number of states in a FSM is an easy upper bound on the pumping limit. — Whoopska, Jul 30 '11 at 19:56

score 23 · Answer 2 · answered Aug 04 '10 at 22:29

The problem can be restated as, "do the languages described by two or more regular expressions have a non-empty intersection"?

If you confine the question to pure regular expressions (no backreferences, lookahead, lookbehind, or other features that allow recognition of context-free or more complex languages), the question is at least decidable. Regular languages are closed under intersection, and there is an algorithm that takes the two regular expressions as inputs and produces, in finite time, a DFA that recognizes the intersection.

Each regular expression can be converted to a nondeterministic finite automaton, and then to a deterministic finite automaton. A pair of DFAs can be converted to a DFA that recognizes the intersection. If there is a path from the start state to any accepting state of that final DFA, the intersection is non-empty (a "conflict", using your terminology).

Unfortunately, there is a possibly-exponential blowup when converting the initial NFA to a DFA, so the problem quickly becomes infeasible in practice as the size of the input expressions grows.

And if extensions to pure regular expressions are permitted, all bets are off -- such languages are no longer closed under intersection, so this construction won't work.

michid · Answer 3 · 2010-08-04T22:44:59.650

Yes I think this is solvable: instead of thinking of regular expressions as matching strings, you can also think of them as generating strings. That is, all the strings they would match.

Let [R] be the set of strings generated by the regular expression R. Then given to regular expressions R and T, we could try to match T against [R]. That is [R] matches T iff there is a string in [R] which matches T.

It should be possible to develop this into an algorithm where [R] is lazily constructed as needed: where normal regular expression matching would try to match the next character from an input string and then advance to the next character in the string, the modified algorithm would check whether the FSM corresponding to the input regular expression can generate a matching character at its current state and then advances to 'all next states' simultaneously.

I don't quite get it. But then again the list of strings they would match is potentially infinite... Also, trying to match T against [R] is the key point here. Your algorithm'd need to be better defined imho. — jjmontes, May 31 '16 at 00:03

score 1 · Answer 4 · answered Jan 27 '18 at 07:28

Another approach would be to leverage Dan Kogai's Perl Regexp::Optimizer instead.

  use Regexp::Optimizer;
  my $o  = Regexp::Optimizer->new->optimize(qr/foobar|fooxar|foozap/);
  # $re is now qr/foo(?:[bx]ar|zap)/

.. first, optimize and then discard all redundant patterns.

Maybe Ron Savage's Regexp::Assemble could be even more helpful. It allows assembling an arbitrary number of regular expressions into a single regular expression that matches all that the individual REs match.* Or a combination of both.

_{* However, you need to be aware of some differences between Perl and Java or other PCRE-flavors.}

score 1 · Answer 5 · answered Sep 12 '18 at 14:30

If you are looking for a lib in Java you can use Automaton using '&' operator:

RegExp re = new RegExp("(ABC_123.*56.txt)&(ABC_12.*456.*\\.txt)", RegExp.INTERSECTION); // Parse RegExp
    Automaton a = re.toAutomaton(); // convert RegExp to automaton

    if(a.isEmpty()) { // Test if intersection is empty
      System.out.println("Intersection is empty!");
    }
    else {
      // Print the shortest accepted string
      System.out.println("Intersection is non-empty, example: " + a.getShortestExample(true));
    }

Original Answer:

Detecting if two regexes could possibly match the same string

Regex: Determine if two regular expressions could match for the same input?

5 Answers5

Linked

Related