Calculate if two infinite regex solution sets don't intersect

Question

In calculate if two arbitrary regular expressions have any overlapping solutions (assuming it's possible).

For example these two regular expressions can be shown to have no intersections by brute force because the two solution sets are calculable because it's finite.

^1(11){0,1000}$ ∩     ^(11){0,1000}$        = {}
{1,111, ..., ..111} ∩ {11,1111, ..., ...11} = {}
{}                                          = {}

But replacing the {0,1000} by * remove the possibility for a brute force solution, so a smarter algorithm must be created.

^1(11)*$ ∩ ^(11)*$ = {}
{1,^1(11)*$} ∩ {^(11)*$} = {}
{1,^1(11)*$} ∩ {11,^11(11)*$} = {}
{1,111,^111(11)*$} ∩ {11,^(11)*$} = {}
.....

In another similar question one answer was to calculate the intersection regex. Is that possible to do? If so how would one write an algorithm to do such a thing?

~~I think this problem might be domain of the halting problem.~~

EDIT:

I've used the accepted solution to create the DFAs for the example problem. It's fairly easy to see how you can use a BFS or DFS on the graph of states for M_3 to determine if a final state from M_3 is reachable.

DFA solution

the linked question has some very good answers. Why is this not a duplicate in your opinion? IOW, please motivate the legitimacy of this 'reformulation'? — sehe, Oct 11 '11 at 21:48
The linked question generated many theoretical answers. I like this question as posed because it explicitly invites a practical solution. It reminds me of a question which I posed earlier this year, motivated simply by the need to efficiently evaluate ~100 regexes against ~10^9 strings. It is useful to "precalculate" regex "relationships" (orthogonality, >, <, ~) in such a scenario only if if the logic which is used to calculate relationships is efficient — Peter, Oct 12 '11 at 01:58

Patrick87 · Accepted Answer · 2011-10-12T01:58:42.913

18

It is not in the domain of the halting problem; deciding whether the intersection of regular languages is empty or not can be solved as follows:

Construct a DFA M1 for the first language.
Construct a DFA M2 for the second language. Hint: Kleene's Theorem and Power Set machine construction
Construct a DFA M3 for M1 intersect M2. Hint: Cartesian Product Machine construction
Determine whether L(M3) is empty. Hint: If M3 has n states, and M3 doesn't accept any strings of length no greater than n, then L(M3) is empty... why?

Each of those things can be algorithmically done and/or checked. Also, naturally, once you have a DFA recognizing the intersection of your languages, you can construct a regex to match the language. And if you start out with a regex, you can make a DFA. This is definitely computable.

EDIT:

So to build a Cartesian Product Machine, you need two DFAs. Let M1 = (E, q0, Q1, A1, f1) and M2 = (E, q0', Q2, A2, f2). In both cases, E is the input alphabet, q0 is the start state, Q is the set of all states, A is the set of accepting states, and f is the transition function. Construct M3 where...

E3 = E
Q3 = Q1 x Q2 (ordered pairs)
q0'' = (q0, q0')
A3 = {(x, y) | x in A1 and y in A2}
f3(s, (x, y)) = (f1(s, x), f2(s, y))

Provided I didn't make any mistakes, L(M3) = L(M1) intersect L(M2). Neat, huh?

edited Oct 12 '11 at 01:58

answered Oct 11 '11 at 21:45

Patrick87

27,682
3
38
73

5

Note: this is assuming _true_ regular expressions (which, in real life, almost no regex engines are limited to due to lookaheads, lookbehinds, back references and what else has not been added on for pragmatic reasons) – sehe Oct 11 '11 at 21:50
3

@sehe: Just a note: lookahead/lookbehind don't add power. They are still in the domain of *true* regular expressions :) – porges Oct 11 '11 at 22:06
Do you think you could a bit more details about `Cartesian Product Machine construction`? Or even show how it could be done using the example I gave? All I could find about it was a [partial book](http://bit.ly/qmxRmZ) on google books. – Kendall Hopkins Oct 11 '11 at 22:07
@sehe: Good point. The problem of determining whether the intersection of two CF languages is empty cannot be solved algorithmically. (It can for a CF and a regular language, though). – Patrick87 Oct 11 '11 at 23:52
You have `q0'' = (q0, q0'')` do you mean `q0'' = (q0, q0')`? Or else the definition would be recursive. – Kendall Hopkins Oct 12 '11 at 01:02
@Patrick87 Also can you recommend any more reading on the subject of `Cartesian Product Machine` and their construction? – Kendall Hopkins Oct 12 '11 at 01:05
@Kendall Hopkins: I think Martin's Introduction to the Theory of Computation discusses this. Might not be the right title... will check tomorrow. You were right about q0'' = (q0, q0'). – Patrick87 Oct 12 '11 at 02:00
Re your "exercise for the reader": Suppose M3 accepts no string of length <= n, but does accept some string s of length m > n. Because there are only n states, this means that at least one state must be visited more than once while processing s. Of all these repeatedly-visited states, let x be the first one visited. Suppose the first visit to x occurs after reading i symbols of s, and the last visit occurs after reading j > i symbols. Then deleting the substring [i+1, j] from s produces a new string that is at least 1 symbol shorter and is still accepted by M3... – j_random_hacker Oct 12 '11 at 16:42
2

... This can be repeated as long as the resulting string is longer than n, so eventually a string with length <= n will be produced, which contradicts the original assumption. – j_random_hacker Oct 12 '11 at 16:45
@j_random_hacker: Yep, you got it. Good job! – Patrick87 Oct 12 '11 at 17:05
I'm attempting to use your defined algorithm to create the DFAs for my example? Could you check if [this is correct](http://imgur.com/KLSmV)? And then I can confirm that `M_3` is empty by walking every possible solution of len 6? Or do I only need to walk length 3 since there are only 3 reachable nodes? – Kendall Hopkins Oct 12 '11 at 23:53
Wouldn't another way to approach proving M_3 empty is to walk the states, but never repeatedly visit a state. Either you run out of paths to walk (empty) or you find a state that is in the accepted state set (not empty). – Kendall Hopkins Oct 13 '11 at 00:07
@Kendall: I think it's even simpler than you suggest in your last comment -- all you need to do is "forget" the symbol labels on the transition arcs (leaving just a directed graph of states) and test whether the initial state is connected to at least one accepting state (i.e. whether some path exists between them). So plain old depth-first search or breadth-first search should work. :) – j_random_hacker Oct 13 '11 at 00:38
1

... which I see now is essentially what you're already saying :) As you say, the DFS/BFS needs to avoid previously-visited states to avoid infinite loops. – j_random_hacker Oct 13 '11 at 00:41
@j_random_hacker Awesome! I believe I can implement this at the [regex level](https://github.com/KendallHopkins/RegexCompiler) without even converting to a DFA (since they are so similar). Thanks for the explanation for the bonus. It helped alot in my understanding of this solution. – Kendall Hopkins Oct 13 '11 at 00:46

Kendall Hopkins · Answer 2 · 2012-08-13T23:05:44.310

I've created a PHP implementation of Patrick87 answer. In addition to implementing the Intersection via Cartesian Product Machine, I've also implemented an alterative algorithm for finding Intersections of DFAs using De Morgan.

Intersection( DFA_1, DFA_2 ) === ! UNION( ! DFA_1, ! DFA_2 )

* ! is defined as negation

This works very well for DFAs as the negation of a fully defined DFA (those with every possible transition state defined) is just to add all non-final states to the final state set and remove all current final states from the final state set (non-final -> final, final -> non->final). Union of DFA can be done easily by turning them into a NFA and then creating a new starting node that connects the unioned DFA's old start nodes by lambda transforms.

In addition to solving the intersection problem, the library I created is also able to determinize a NFA to a DFA and convert Regex to NFA.

EDIT:

I have created a webapp that allows this sort of transformations on regex languagues using what I learned form this question (and others).

Calculate if two infinite regex solution sets don't intersect

2 Answers2

Linked