9

I'm attempting to implement a text clustering algorithm. The algorithm clusters similar lines of raw text by replacing them with regexes, and aggregates the number of patterns matching each regex so as to provide a neat summary of the input text instead of showing repetitive patterns from the input text. In this attempt I ran into the need to find if one regex covers another.

Assume we are concerned only about regular expressions with '*' and '+' wild cards i.e. '*' meaning zero or more occurences of an alphabet, and '+' meaning 1 or more occurences of an alphabet. Also assume the character set to be ASCII.

For example:

1. AB covers AB
      This is straightforward.
2. ABC* covers ABC
      Because ABC* can generate: ABC, ABCC, ABCCC etc.
3. A*B+C* covers AB+C*
      Because A*B+C* can generate ABBC, AABBC, AABBCC etc. which covers
      all strings generated by AB+C*.
4. A+M+BC* covers AMM+B+C+M+BC*
      Similar to case [3] above.

Basically I'm looking for an efficient implementation of the following method which tells if strA (may contain a regex) covers for strB (may contain a regex). Note that there should also be a way to escape the regex characters '*' and '+' in the input strings strA and strB.

Method signature in C++:

bool isParentRegex(const string& strA, const string& strB)

My thinking is that a recursive approach is required for the implementation and it may be a bit complicated. But I'm curious to know if I could reuse existing implementations instead of re-inventing the wheel or if there are any other straightforward ways to do it.

Kowshik
  • 1,541
  • 3
  • 17
  • 25
  • possible duplicate of [Regex: Determine if two regular expressions could match for the same imput?](http://stackoverflow.com/questions/3410256/regex-determine-if-two-regular-expressions-could-match-for-the-same-imput) –  Mar 27 '12 at 11:35
  • Like most things, this would be easier in Perl. :-) – ruakh Mar 27 '12 at 13:31
  • 1
    @JarrodRoberson: No, that's about overlap (is there _any_ input) and this is a subset question. – MSalters Mar 28 '12 at 07:39

4 Answers4

4

Considering the simple regex grammar you propose, the solution is fairly trivial.

Take your more complex example, A+M+BC* covers AMM+B+C+M+BC* You can rewrite it as A{1,}M{1,}B{1,1}C{0,} covers A{1,1}M{2,}B{1,}C{1,}M{1,}B{1,1}C{0,}

This leads us to the simple rule: R1 covers R2 if all symbols appear in the same order, all lower bounds of R1 are less or equal to those of R2, and the upper bounds of R1 are more or equal than those of R2.

Now there's one slight problem with the simple rule. AB*C covers AC, i.e. there's a possibility that an optional symbol appears in R1 and not in R2. You can solve that by inserting a {0,0} in R2 when there's an (optional) symbol in R1 that doesn't appear in the equivalent position in R2. E.g. AB*C does cover AB{0,0}C.

The "optional symbol" rule is an optimization. If the symbol in R1 isn't optional, R1 certainly doesn't cover R2. E.g. AB+C doesn't cover AC. Therefore there's no need to insert a B{0,0}. But if you do, you'll see that A{1,1}B{1,}C{1,1} doesn't cover A{1,1}B{0,0}C{1,1} since the R1 lower bound on B (1) is more that the R2 lower bound on B (0)

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • 1
    I think this is trickier than you're letting on; consider the case that `R1` is `A{1,}B{0,}A{1,}` and `R2` is `A{2,}`. In order to insert `B{0,0}` in the right spot, you need to split `A{2,}` up, which requires having done some lookahead in `R1`. – ruakh Mar 27 '12 at 22:44
  • I hadn't considered `A{2,}` as input; the question stated `A*` and `A+` only. So your case could be `A+B*A+` versus `A+A+`, and then it's clear where you'd insert the `B{0,0}`. It could also have been `A+B*A+` versus `A*A+A+A*` (the latter is also equal to `A{2,}`) and then it's indeed not so clear. – MSalters Mar 28 '12 at 07:47
  • I didn't mean `A{2,}` as input, I meant it as the result of your rewrite. (Oh, or did you mean that the insertion of `B{0,0}` would happen *during* rewriting? I'd been picturing it as a second step.) But even in `A+B*A+` vs. `A+A+` it's not so clear where to insert the `B{0,0}` unless you do some sort of lookahead or backtracking to consider *both* the possibility that the first `A+` corresponds to `A+A+` *and* the possibility that it corresponds only to one `A+`. – ruakh Mar 28 '12 at 11:35
  • And... What if I have a ECMA-Regex with it's features? Your solution is not full. – Alex A. Jul 25 '21 at 12:03
2

I would do something like implementing a function for finding the minimal DFA from a given Regular Expression. Let's Assume that

DFA GetMinimalDFA(Regex r1) does that.

bool isParentRegex(Regex r1, Regex r2) {
    DFA a = GetMinimalDFA(r1);
    DFA b = GetMinimalDFA(Regex.OR(r1,r2))
    return a.Equals(b);
}
Stavros Zotalis
  • 718
  • 4
  • 24
  • The minimal DFA? I'm not sure that works: is the minimal DFA unique? And, computing the *minimal* DFA is not necessarily a simpler task than the original question. (In fact, I would guess that it is harder.) (EDIT: actually, [DFA minimisation](https://en.wikipedia.org/wiki/DFA_minimization) is a known problem, so maybe not) – huon Mar 27 '12 at 12:42
  • 1
    Typical Comp.Sci. answer, though: "I don't know the solution to this problem, but I can show it's just as hard as another problem" :P But yeah, the Wiki link shows that there's one unique minimal DFA. Still, `DFA.Equals(DFA)` is another hard function. – MSalters Mar 27 '12 at 13:30
  • @MSalters, oh, yeah, I should've read it rather than skimmed it. :) – huon Mar 28 '12 at 12:28
2

In Perl, this would be pretty simple. Step one is to normalize each regex by changing A+ to AA*, A*A to AA*, and A*A* to A*:

sub normalize_regex($)
{
    local $_ = shift;
    s/(.)\+/$1$1*/g;
    1 while s/(.)\*\1(?!\*)/$1$1*/g or s/(.\*)\1/$1/g;
    return $_;
}

Step two is to convert the first regex from a regex that matches the strings themselves, to a Perl-regex that matches the normalized regexes that match those strings; for example, AA*B would be converted to ^AA*\*?B$, meaning "start-of-string, followed by A, followed by zero or more A's, optionally followed by an asterisk, followed by B, followed by end-of-string":

sub regex_to_metaregex($)
{
    local $_ = shift;
    s/(.)(\*?)/$2 ? "\Q$1\E*(\Q$1\E\\*)?" : "\Q$1"/eg;
    return qr/^$_$/;
}

Step three needs no explanation:

sub does_regex1_cover_regex2($$)
{
    my ($r1, $r2) = @_;
    $r1 = regex_to_metaregex normalize_regex $r1;
    $r2 = normalize_regex $r2;
    return scalar $r2 =~ m/$r1/;
}

This returns a true value for your cases #1–3. It returns a false value for your case #4, however, because unless I'm really missing something, A+M+BC* does not cover AMM+B+C+M+BC*?

Note that there should also be a way to escape the regex characters '*' and '+' in the input strings strA and strB.

I did not worry about that in the above code, but since you're only worried about ASCII, a preprocessing step could handle \* meaning *, \+ meaning +, and \\ meaning \, by translating them into single characters outside the ASCII range:

sub process_escapes($)
{
    local $_ = shift;
    s/\\\\/\x80/g;
    s/\\\+/\x81/g;
    s/\\\*/\x82/g;
    s/\x80/\\/g;
    return $_;
}

(though that's obviously rather hackish).

In C++, you can use the same approach — there exist libraries that implement all the necessary features of Perl regexes — though obviously it'd be a bit more work.

ruakh
  • 175,680
  • 26
  • 273
  • 307
0

please check this perl module source but remember it would not work for all regexps (as it will result in solving the halting problem.

neutrinus
  • 1,879
  • 2
  • 16
  • 21
  • 1
    The Halting problem applies to Turing Machines: regular languages (which are described by regular expressions) are much weaker than TMs, so it is entirely feasible that there is a general solution to this problem. – huon Mar 27 '12 at 12:46
  • 1
    Perl "regex" isn't a regular language in the formal sense. See [this earlier question](http://stackoverflow.com/questions/7983115/are-perl-regexes-turing-complete). But yes, even those "non-regular regexes" are still weaker. – MSalters Mar 27 '12 at 13:35