16

Given a regular expression R that describes a regular language (no fancy backreferences). Is there an algorithmic way to construct a regular expression R* that describes the language of all words except those described by R? It should be possible as Wikipedia says:

The regular languages are closed under the various operations, that is, if the languages K and L are regular, so is the result of the following operations: […] the complement ¬L

For example, given the alphabet {a,b,c}, the inverse of the language (abc*)+ is (a|(ac|b|c).*)?


As DPenner has already pointed out in the comments, the inverse of a regular expresion can be exponentially larger than the original expression. This makes inversing regular expressions unsuitable to implement negative partial expression syntax for searching purposes. Is there an algorithm that preserves the O(n*m) runtime characteristic (where n is the size of the regex and m is the length of the input) of regular expression matching and allows for negated subexpressions?

Community
  • 1
  • 1
fuz
  • 88,405
  • 25
  • 200
  • 352
  • 7
    Construct the NFA from regex, construct the DFA from NFA (there should be an algo for this), turn the non-terminal states into terminal states and vice-versa, then derive the regex from DFA. – nhahtdh Mar 11 '13 at 12:01
  • @nhahtdh That sounds like an interesting idea; the problem is, that a NFA of length O(n) may correspond to a DFA of length O(2^n). – fuz Mar 11 '13 at 14:11
  • I didn't go through proper study on this topic, but I can't see how an NFA would generate a bigger DFA. Can you show me one such case? – nhahtdh Mar 11 '13 at 16:54
  • 1
    @nhahtdh Each state of the DFA associated with an NFA is a reachable member of the powerset of the NFA. This makes an upper bound of 2^n states where n is the number of states of the NFA. It is possible to reach this by constructing a regular language where each subset of NFA states can be reached. There was a simple example somewhere but I just can't find it. – fuz Mar 11 '13 at 17:43
  • 8
    See the relevant question on [theoretical CS stack overflow](http://cstheory.stackexchange.com/questions/5125/regular-expressions-finding-negation-of-regular-expression). NFA->DFA->inversion->regex is worst-case optimal. – thiton Mar 11 '13 at 19:06
  • Your Wikipedia quote doesn't match up with your description of the problem. The reverse of a language is the set of all strings from that language reversed. i.e. if L = {ab, abc, aba}, the reverse, L^R = {ba, cba, aba}. Regardless though, its still true that the negation (or more technically the complement) is still regular. – DPenner1 Mar 11 '13 at 19:06
  • @DPenner I am sorry. I may have misread that article. – fuz Mar 11 '13 at 19:10
  • @FUZxxl That's alright, formal languages can be confusing! As you rightly point out though, the standard algorithm for negating a regular expression is O(2^n) in space, and would be impractical for languages with a large amount of states in its NFA. – DPenner1 Mar 11 '13 at 19:16
  • 1
    This is certainly an interesting question, but is this really a programming question? Wouldn't this be a better fit for [ComputerScience](http://cs.stackexchange.com/?as=1)? (Especially since you seem uninterested in working, if not mathematically pure, solutions?) – JDB Mar 11 '13 at 19:41
  • 1
    @Cyborgx37 Let me quote the FAQ: *We feel the best Stack Overflow questions have a bit of source code in them, but if your question generally covers … • a specific programming problem • a software algorithm* – fuz Mar 11 '13 at 19:43
  • 1
    It all boils down to a programming question. You can do a lot of funny things with plain (aka really regular) regular expresion that you can't do with "enhanced" regex, such as letting an untrusted user use them to search stuff, since they have a guaranteed linear runtime. All toolkits for plain regex I know don't support inverse matching, even though the could be converted to plain regex. It would be nice to know whether there is an algorithm that allows to implement a "match inverse" operator without breaking the linear time complexity. – fuz Mar 11 '13 at 19:46
  • @Cyborgx37 “… programming **related** …”. ’nuff said. – Konrad Rudolph Mar 11 '13 at 19:48
  • 1
    @FUZxxl: For your second part, in practice, it will usually involve checking whether there is a match, then logically negate the result, or use negative look-ahead construct in the (ir)regular expression. – nhahtdh Mar 11 '13 at 19:49
  • @nhahtdh Look ahead is not part of plain regex. It might break linear complexity. – fuz Mar 11 '13 at 19:50
  • 1
    @FUZxxl: Consider an extension of plain regex, where you can have plain regex, OR negative look-ahead assertion which can only contain plain regex inside (not even another assertion inside). I think the complexity will be the same. The complexity is only broken when you have a wild mix of plain regex and look-ahead. – nhahtdh Mar 11 '13 at 19:56
  • @FUZxxl - Ok, I buy it. Ponder, though, that "you can't prove a negative". In most branches of science, it is most often the case that a question must be stated as a positive and then the answer given in terms of evidence to support it (or lack thereof). I believe you'll find the same is true with all but the most trivial regex. It is much easier to express what you are looking for, then negate the result. – JDB Mar 11 '13 at 19:57
  • @Cyborgx37 As nhahtdh wrote in the very first comment, it is indeed easily (in the sense that the algorithm is simple) possible to compute the inverse of a regex. It's just not easy (in the sense that it has a high computational complexity). – fuz Mar 11 '13 at 20:00
  • 1
    @nhahtdh Wouldn't that requite backtracking? The key to a linear time implementation is to avoid backtracking as hell. – fuz Mar 11 '13 at 20:02
  • The problem with my extension is that, while it is useful to negate some plain regex, it breaks the property where 2 regex can be freely concatenated, OR'ed together - which I think is the motivation for finding the plain regex which is the negation of another plain regex. – nhahtdh Mar 11 '13 at 20:03
  • @FUZxxl: Assuming that you can match a string with linear complexity (or confirm that there is no match) with a plain regex, the negative look-ahead simply negate the result of matching. (If the assumption is wrong, then even plain regex will have trouble checking that there is no match). – nhahtdh Mar 11 '13 at 20:05
  • @nhahtdh I can't really imagine how this should work. Could ou post a more detailed sketch of the algorithm in an answer? – fuz Mar 11 '13 at 20:14
  • 1
    @FUZxxl: It is not an algorithm or anything. Simply put, it is a form of `checking whether there is a match, then logically negate the result` (has match --> false, no match --> true). – nhahtdh Mar 11 '13 at 20:19
  • @nhahtdh This seems to fail when the negated expression is part of a larger expression. – fuz Mar 11 '13 at 20:38
  • @FUZxxl: Of course it will fail (check my "formal" definition of the extension, and the caveat a few comments later) – nhahtdh Mar 11 '13 at 20:47
  • Yes [Complement of a regular language is regular](http://stackoverflow.com/questions/14802732/finding-the-complement-of-a-dfa?lq=1) and there is an algorithmic way possible to write a complement regular expression for a regular expression. Additionally as DPenner says about inefficiency. We can always *good* RE according to minimized DFA. So for an RE there we can evaluate RE that is capable to accept complement language. – Grijesh Chauhan Mar 15 '13 at 16:17
  • What do you mean: »We can always good RE according to minimized DFA« – fuz Mar 16 '13 at 08:27
  • @FUZxxl no! that why I use word good instead of minimized. – Grijesh Chauhan Mar 17 '13 at 10:34
  • @GrijeshChauhan I still don't really get what you want to tell me. I am sorry. I have problems parsing your grammar. – fuz Mar 17 '13 at 10:35
  • @FUZxxl OK, `(1)` Complement of Regular Language(RE) is regular `(2)` to find RE of complement language. do like `RE-->NFA-->DFA-->minimized DFA --> COMPLEMENT DFA --> RE`. (a) `RE-->NFA` is algorithmic LEX parser (b) `NFA --> DFA` is algorithm (c) `DFA --> MINIMIZED DFA` again algorithmic (*both `b` and `c` school time assignments*) (d) `DFA --> COMPLEMENT DFA` is algorithmic (*by reading my answer I lined you can realized*) (e) complement `DFA-->RE` is algorithm using ARDEN'S THEOREM. So What I mean to say for a given RE you can find RE for complement. – Grijesh Chauhan Mar 17 '13 at 10:46
  • @FUZxxl because for complement DFA we need to make complement DFA (not partial) That is reason we gets complicated RE for complement language. But We can do some reduction work by MINIMIZE complement DFA – Grijesh Chauhan Mar 17 '13 at 10:50
  • @Grijesh Thanks for summing up what all the others said before. The problem is that the DFA's size might be exponential to the size of the NFA, causing the complement regex to have a size exponential to the original regex. – fuz Mar 17 '13 at 11:14
  • @FUZxxl Yes but you can always minimized NFA into DFA. Also What I means *complete DFA* always comes with more states. – Grijesh Chauhan Mar 17 '13 at 12:31
  • @GrijeshChauhan Tell me please how this reduces the length of the generated regular expression or the intermediate DFA. – fuz Mar 17 '13 at 13:22

2 Answers2

4

Unfortunately, the answer given by nhahdtdh in the comments is as good as we can do (so far). Whether a given regular expression generates all strings is PSPACE-complete. Since all problems in NP are in PSPACE-complete, an efficient solution to the universality problem would imply that P=NP.

If there were an efficient solution to your problem, would you be able to resolve the universality problem? Sure you would.

  1. Use your efficient algorithm to generate a regular expression for the negation;
  2. Determine whether the resulting regular expression generates the empty set.

Note that the problem "given a regular expression, does it generate the empty set" is fairly straightforward:

  1. The regular expression {} generates the empty set.
  2. (r + s) generates the empty set iff both r and s generate the empty set.
  3. (rs) generates the empty set iff either r or s generates the empty set.
  4. Nothing else generates the empty set.

Basically, it's pretty easy to tell whether a regular expression generates the empty set: just start evaluating the regular expression.

(Note that while the above procedure is efficient in terms of the output length, it might not be efficient in terms of the input length, if the output length is more than polynomially faster than the input length. However, if that were the case, we'd have the same result anyway, i.e., that your algorithm isn't really efficient, since it would take exponentially many steps to generate an exponentially longer output from a given input).

Patrick87
  • 27,682
  • 3
  • 38
  • 73
  • `Whether a given regular expression generates all strings` - The problem description is quite vague here. I don't know what you are talking about. – nhahtdh Mar 11 '13 at 19:39
  • Thank you for the answer. Could you, if you like, also investigate into the second part of my question? – fuz Mar 11 '13 at 19:41
  • "Whether a given regular expression generates all strings is PSPACE-complete" I'm looking for a proof of that claim, do you have one? – a3nm Jun 01 '19 at 07:45
1

Wikipedia says: ... if there exists at least one regex that matches a particular set then there exist an infinite number of such expressions. We can deduct from this statement that there is an infinite number of expressions that describe the language of all words except those described by R.

Again, (as also @nhahtdh tried to explain) the simplest algorithm to address this question is to extend the scope of evaluation outside the context of the regular expression language itself. That is: match the strings you want to exclude (which represent a finite subset to work with) by using the original regular expression and then treat any failure to match as an actual match (out of an infinite set of other possibilities). So, if the result of the match is negative, your candidate strings are a subset of the valid solutions.

Alex Filipovici
  • 31,789
  • 6
  • 54
  • 78
  • `then compare it with the finite set of candidates you provide` Not really. Just try to match the string with the (plain) regex. If fail to match = a match. This only works if you want to particularly negate a whole plain regex. It cannot concatenate. – nhahtdh Mar 11 '13 at 20:51
  • @nhahtdh, I tried to reformulate the answer. Anywho, I don't see a real application of the question. – Alex Filipovici Mar 11 '13 at 21:07
  • Check [this comment](http://stackoverflow.com/questions/15337458/is-there-a-way-to-negate-a-regular-expression/15348226#comment21678409_15337458) – nhahtdh Mar 11 '13 at 21:09