Intersection of two regular expressions

Question

Im looking for function (PHP will be the best), which returns true whether exists string matches both regexpA and regexpB.

Example 1:

$regexpA = '[0-9]+';
$regexpB = '[0-9]{2,3}';

hasRegularsIntersection($regexpA,$regexpB) returns TRUE because '12' matches both regexps

Example 2:

$regexpA = '[0-9]+';
$regexpB = '[a-z]+';

hasRegularsIntersection($regexpA,$regexpB) returns FALSE because numbers never matches literals.

Thanks for any suggestions how to solve this.

Henry

Hm ... most regex (or true regex) are equivalent to finite automata. I wonder if, given two such automata over the same alphabet, one can tell whether there is some string in the language that is accepted by both. Personally, I do not see a general way without brute-forcing all strings of length 1, 2, 3, 4, 5 ... in general this can take "forever" to check. — Hamish Grubijan, Jun 03 '10 at 14:35
@rfusca: Brute forcing over an infinite list of possible inputs is trivially undecidable. If there is no intersection, a brute force algorithm would never terminate. — sepp2k, Jun 03 '10 at 14:50
@Hamish: It is perfectly possible to find the intersection of two FAs (and thus of two regular languages). The only problem is that most regex engines found in today's programming languages actually have features that allow you to match non-regular languages (backreferences are a common one). I'm pretty sure php's is one of them. So if Henry were to create a solution using finite automata, it would not work for all regexen, only for those that only use regular features. — sepp2k, Jun 03 '10 at 14:58
@sepp2k, I am dying to know how to find the intersection of two FAs and check whether the resulting Frankenstein accepts anything. Languages being accepted can be infinite, at least the FAs can be represented with a finite number of bits. — Hamish Grubijan, Jun 03 '10 at 15:13

sepp2k · Answer 1 · 2010-06-03T15:29:54.967

For regular expressions that are actually regular (i.e. don't use irregular features like back references) you can do the following:

Transform the regexen into finite automata (the algorithm for that can be found here(chapter 9) for example).
Build the intersection of the automata (You have a state for each state in the cartesian product of the states of the two automata. You then transition between the states according to the original automata's transition rules. E.g. if you're in state x1y2, you get the input a, the first automaton has a transition x1->x4 for input x and the second automaton has y2->y3, you transition into the state x4y3).
Check whether there's a path from the start state to the end state in the new automaton. If there is, the two regexen intersect, otherwise they don't.

Or, use the regex-to-DFA algorithm in the Dragon Book to create the DFA and you're done in one step: if any accepting state is made up of positions from both expressions, then the two expressions intersect. — Ron Burk, Aug 18 '21 at 06:26

score 3 · Answer 2 · answered Jul 30 '13 at 20:53

Theory.

Java library.

Usage:

/**
 * @return true if the two regexes will never both match a given string
 */
public boolean isRegexOrthogonal( String regex1, String regex2 ) {
   Automaton automaton1 = new RegExp(regex1).toAutomaton();
   Automaton automaton2 = new RegExp(regex2).toAutomaton();
   return automaton1.intersection(automaton2).isEmpty();
}

score 2 · Answer 3 · answered Jun 03 '10 at 15:32

A regular expression specifies a finite state machine that can recognize a potentially infinite set of strings. The set of strings may be infinite but the number of states must be finite, so you can examine the states one by one.

Taking your second example: In the first expression, to get from state 0 to state 1, the string must start with a digit. In the second expression, to get from state 0 to state 1, the string must start with a letter. So you know that there is no string that will get you from state 0 to state 1 in BOTH expressions. You have the answer.

Taking the first example: You know that if the string starts with a digit you can get from state 0 to state 1 with either regular expression. So now you can eliminate state 0 for each, and just answer the question for each of the two (now one state smaller) finite-state-machines.

This looks a lot like the well-known "halting problem", which as you know is unsolvable in the general case for a Turing machine or equivalent. But in fact the halting problem IS solvable for a finite-state machine, simply because there are a finite number of states.

I believe you could solve this with a non-deterministic FSM. If your regex had only one transition from each state to the next, a deterministic FSM could solve it. But a regular expression means that for instance if you are in state 2, then if the caracter is a digit you go to state 3, else if the character is a letter you go to state 4.

So here's what I would do:

Solve it for the subset of FSM's that have only one transition from one state to the next. For instance a regex that matches both "Bob" and "bob", and a second regex that matches only "bob" and "boB".
See if you can implement the solution in a finite state machine. I think this should be possible. The input to a state is a pair representing a transition for one FSM and a transition for the second one. For instance: State 0: If (r1, r2) is (("B" or "b"), "b") then State 1. State 1: If (r1, r2) is (("o"), ("o")) then state 2. etc.
Now for the more general case, where for instance state two goes back to state two or an earlier state; for example, regex 1 recognizes only "meet" but regex 2 recognizes "meeeet" with an unlimited number of e's. You would have to reduce them to regex 1 recognizing "t" and regex 2 recognizing "t". I think this may be solvable by a non-deterministic FSM.

That's my intuition anyway.

I don't think it's NP-complete, just because my intuition tells me you should be able to shorten each regex by one state with each step.

As sepp2k pointed out, this only applies to regexes that really DO recognize only regular expressions. — Mark Lutton, Jun 03 '10 at 15:54

mqchen · Answer 4 · 2010-06-03T15:14:15.053

0

It is possible. I encountered it once with Pellet OWL reasoner while learning semantic web technologies.

Here is an example that shows how regular expressions can be parsed into a tree structure. You could then (in theory) parse your two regular expressions to trees and see if one tree is a subset of the other tree, ie. if one tree can be found in within other tree's nodes.

If it is found, then the other regular expression will match (not only, but also) a subset of what the first regular expression will match.

It is not much of a solution, but maybe it'll help you.

edited Jun 03 '10 at 15:14

answered Jun 03 '10 at 15:08

mqchen

4,195
1
22
21

This is fun, but ... once you have this tree, what do you do next? – Hamish Grubijan Jun 03 '10 at 15:15
Looking at the syntax tree of two regular expressions will not tell you whether they're equivalent. Two regexen can have completely different syntax trees and still be equivalent (take `/(1+|0+)+/` and `/0(0|1)*|1(1|0)*/` for example). – sepp2k Jun 03 '10 at 15:16
Hmm, I'm a little confused. How can both `/(1+|0+)+/` and `/0(0|1)*|1(1|0)*/` match the same string? The first regex requires the string to begin with `1` while the other requires it begin with `0`. (I'm a bit tired right now, so I might be completely off...) – mqchen Jun 03 '10 at 15:38
@mq.chen: Both regexen are equivalent to `/(1|0)+/` (or `/(0|1)+/`, which is the same thing). Notice the `|` - both regexen can start with either 1 or 0. – sepp2k Jun 03 '10 at 16:54

Intersection of two regular expressions

4 Answers4

Linked