8

Is there a way to test if a regular expression "contains" another regular expression?
For example:

RegEX1 = "a.*b";
RegEx2 = "a1.*b";

RegEX1 "contains" RegEX2.

As far as I know - this can't be done, am I wrong?


OK, joel.neely has shown that it can be done (haven't read it yet...) academically.

Can it be done in a programming language, say C# ?
How effective will that be? How long will it take to test 1000 pairs?
Dror
  • 7,255
  • 3
  • 38
  • 44

2 Answers2

6

Yes.

This paper contains a detailed discussion of the topic (see section 4.4).

joel.neely
  • 30,725
  • 9
  • 56
  • 64
  • 2
    Can you clarify your "yes". I think you are saying "Yes, you are wrong" and citing the paper that shows how it can be done (from a quick look at the paper). But it would be worth spelling that out explicitly. – Jonathan Leffler Jan 06 '09 at 13:28
  • 1
    The paper mentioned only says "It is a well known result that for two regular expressions B and R, it is readily decidable whether B subsumes R" and then goes on to describe "content models." Also, the paper's method appears to be simply enumerating all strings with length < n (calculated somehow?) and checking whether they are in the second expression but not the first. Decidable, perhaps, but not exactly feasible with 26^n options even without considering case and punctuation. – Clueless Feb 24 '10 at 06:25
0

Converting the two expressions to the equivalent state machines, and checking all paths in both machines allow the same matches, should do the trick. The pumping lemme should obviously be minded, so avoid revisiting old nodes.

It would only work for "simple" regular expressions (or real, what have you, perls recursive expressions are much more expressive).

While a graph of the state machine could have a large number of paths, it should still be limited (esp if the source for the expressions are human). So you'd find all the allowable paths of RegEX1, and check, each in turn, if it's allowable in RegEX2. If all paths are valid, you'd know that the one is contained in the other.

Svend
  • 7,916
  • 3
  • 30
  • 45
  • Is it possible (in a reasonable time) to run a test to get a hierarchy of regular expression (several hundreds of them)? can you provide pointers to code that does that? – Dror Jun 10 '09 at 13:56
  • I don't see why it would not possible, and in decent time as well. You'd have to build this from scratch though, which is not trivial. – Svend Jun 10 '09 at 15:00
  • checking "all paths are valid" for all pairs would probably take a very long time. "checking all paths in both machines" as you say may be infinite, or am i missing something? – Dror Jun 11 '09 at 05:30
  • 1. You shouldn't need to revisit nodes due to the pumping lemma 2. Not all pairs need to be checked. You find the paths in one, and see if it's allowable in the other. A large class of paths will obviously share parts in common, those need only be checked once, etc. – Svend Jun 11 '09 at 09:34