20

Let's say we have regular expressions:

  • Hello W.*rld
  • Hello World
  • .* World
  • .* W.*

I would like to minimize the number of regexes required to match arbitrary input.

To do that, I need to find if one regular expression matches any input matched by another expression. Is that possible?

Billy3

Charles
  • 11,269
  • 13
  • 67
  • 105
Billy ONeal
  • 104,103
  • 58
  • 317
  • 552
  • @skaffman: I think the regular-language tag is appropriate given that a regex describes a regular language -- it's just a simple way of representing it "on paper". But the question w.r.t. computer science has more to do with regular languages than regular expressions. – Billy ONeal Sep 02 '10 at 18:55
  • 1
    eh, title does not match the description? – maxschlepzig Sep 02 '10 at 19:02
  • I'm not sure if qualifies as an "algorithm", but using ".*" matches arbitrary input with one regular expression; I doubt it can be minimized to fewer than 1. :-) – Jerry Coffin Sep 02 '10 at 19:12
  • @Jerry: Well, these are just examples :) In the real cases they're more complicated. @maxschlepzig: I have modified the description slightly. – Billy ONeal Sep 02 '10 at 19:23

4 Answers4

12

Any regular expression can be linked to a DFA - you can minimize the DFA and since the minimal form is unique, you can decide whether two expressions are equivalent. Dani Cricco pointed out the Hopcroft O(n log n) algorithm. There is another improved algorithm by Hopcroft and Craft which tests the equivalence of two DFAs in O(n).

For a good survey on the matter and an interesting approach to this, I reccomend the paper Testing the Equivalence of Regular Languages, from arXiv.

Later edit: if you are interested in inclusion rather than equivalence for regular expressions, I have come across a paper that might be of interest: Inclusion Problem for Regular Expressions - I have only skimmed through it but it seems to contain a polynomial time algorithm to the problem.

Lawrence
  • 148
  • 6
  • Hmmm.. interesting. One issue though is that `.*` and `Hello World` are decidedly not equivalent, though `.*` can match anything `Hello World` can match. – Billy ONeal Sep 02 '10 at 19:37
  • I'm not sure of the meaning of "match" for you - it seems as though you don't want to test equivalence but rather inclusion. Can you be more exact with your question? – Lawrence Sep 02 '10 at 19:41
  • My difficulty is that I don't know exactly how to describe what I'm looking for -- I'm sorry for the runaround here. I have modified the question slightly -- from Wikipedia's description on set theory inclusion does seem to what I need. – Billy ONeal Sep 02 '10 at 19:58
  • @Billy: I think the linked paper I just added is what you're looking for. – Lawrence Sep 02 '10 at 20:00
  • 1
    @Billy ONeal: If you're OK with an approximation (which you probably are), the last paper in this answer is probably good for you. – jpalecek Sep 02 '10 at 21:19
  • I am the author of "The Inclusion Problem for Regular Expressions" mentioned above. The algorithm is polynomial, but only handles a subclass of regular expressions, called 1-unambioguous. It also does not include commonly used extensions like counters or backreferences. The ".*" in the expressions above make then not 1-unambiguous. – Dag Sep 23 '16 at 08:10
3

Sure!. A regular expression can be represented as an FSM (Finite State Machine) and there are technically infinite number of FSM that can recognize the same string.

Isomorphism is the name that describes if two FSM are equivalent. There are a couple of algorigthm to minimize an FSM. For example the Hopcroft minimization algorithm can minimize two FSM in O(n log n), on an n state automaton.

Dani Cricco
  • 8,829
  • 8
  • 31
  • 35
  • @Dani: Same problem with maxschlepzig's answer. Isomorphism is in class NP. – Billy ONeal Sep 02 '10 at 19:28
  • 2
    @Billy ONeal: First, (graph) isomorphism is in NP (that's true), but is believed not to be NP-complete, although not in P. However, we are talking about DFA isomorphism, which is a _completely_ different thing. – jpalecek Sep 02 '10 at 19:50
  • @jpalecek: How is DFA isomorphism any different? Is a DFA nothing more than a digraph? – Billy ONeal Sep 02 '10 at 20:23
  • 1
    @Billy: DFA isomorphism is in class P – Dani Cricco Sep 02 '10 at 20:25
  • 1
    @Billy DFA isomorphism is simpler. You only have one starting point – Dani Cricco Sep 02 '10 at 20:28
  • 1
    @Billy ONeal: Basically, the biggest difference between a DFA and a graph is that a DFA is labeled, so you know what to match in the first place (initial/terminal states, edges with the same letter). Also, there are DFAs that are isomorphic (we are talking about isomorphism defined by equivalence of suffixes starting from a state, there are other possible definitions of isomorphism) but their graph representation is not. – jpalecek Sep 02 '10 at 21:00
  • Just to be fair: To do what you want (which is equivalent to testing regexp equivalence), you need to do a possibly exponential NFA-to-DFA conversion. This problem is maybe not even in NP. However, this doesn't make it impractical; exponential algorithms are used all around. You just have to assess the trade-offs and if this is not what you need, you have to look for other ways to solve your primary problem (which we know nothing about). – jpalecek Sep 02 '10 at 21:05
2

This problem is called "inclusion" or "subsumption" of regular expressions, because what you are asking for, is whether the set of words matched by one regexp includes (or subsumes) the set of words matched by the other regex. Equality is a different question which usually means whether two regexps matches exactly the same words, i.e. that they are functionally equivalent. For example "a*" includes "aa*", while they are not equal.

All known algorithms for regexp inclusion are the worst case take time exponential in the size of the regexp. But the standard algorithm is like this:

Input r1 and r2 Output Yes if r1 includes r2

  1. Create DFA(r1) and DFA(r2)
  2. Create Neg(DFA(r1)) (which matches exactly those words r1 dont match)
  3. Create Neg(DFA(r1))x DFA(r2) (which matches exactly those words matched by Neg(DFA(r1)) and DFA(r2))
  4. Check that the automaton made in 3. does not match any word

This works, since what you are checking is that there are no words matched by r2 that are not matched by r1.

Dag
  • 71
  • 2
2

Yes.

The problem of equivalence of two regular languages is decidable.

Sketch of an algorithm:

  • minimize both DFAs
  • check if they are isomorph
maxschlepzig
  • 35,645
  • 14
  • 145
  • 182
  • Graph isomorphism is not solvable in polynomial time, so I don't see how that helps. – Billy ONeal Sep 02 '10 at 19:15
  • @Billy: I guess your answer is that this is a theoretically solvable problem that isn't practical to solve. – szbalint Sep 02 '10 at 19:23
  • @szbalint: Well "theoretically" I could pose every possible input string to the languages and see if they match the same thing. If it's not solvable on reasonable consumer hardware, then there's little point. – Billy ONeal Sep 02 '10 at 19:27
  • 1
    @Billy ONeal: What have graph isomorphism to do with it? And don't bash theoretical cs. The decidable results are of practical interest. E.g. the problem of equivalence of two context free languages is not decidable. – maxschlepzig Sep 02 '10 at 19:47
  • I'm not "bashing theoretical CS" -- but I think it's obvious enough from my question that proving that the problem is merely decidable would not be all that useful of an answer. – Billy ONeal Sep 02 '10 at 20:00