8

Is it possible to check if a given regular expression will match any string? Specifically, I'm looking for a function matchesEverything($regex) that returns true iff $regex will match any string.

I suppose that this is equivalent to asking, "given a regex r, does there exist a string that doesn't match r?" and I don't think this is solvable without placing bounds on the set of "all strings". I.e., if I assume the strings will never contain "blahblah", then I can simply check if r matches "blahblah". But what if there are no such bounds? I'm wondering if this problem can be solved checking if the regex r is equivalent to .*.

Christopher Neylan
  • 8,018
  • 3
  • 38
  • 51
  • 4
    I believe this is equivalent to the [Halting Problem](http://en.wikipedia.org/wiki/Halting_problem). It may not be possible to write an algorithm to determine if an arbitrary regex is equivalent to `.*` – Jim Garrison Jul 30 '13 at 18:30
  • Regexes with lookarounds and backrefs, but no code interpolation, should be a subset of or equal to context sensitive grammars. Deciding these languages is not Turing complete, therefore this question shouldn't be equivalent to the halting problem. *If*, given a CSG, we can produce a string of this language by substituting the rules, then we may apply a forbidden substitution, thus producing a string that is not in our language. Sadly I don't know whether this is the case, and I wouldn't be able to sketch a proof of this. – amon Jul 30 '13 at 19:47
  • 2
    This is called the "Emptiness Problem", and is decidable for DFAs/NFAs (i.e. regexes without backreferences/lookarounds) http://www.cs.miami.edu/~ogihara/csc527/new04-1.pdf For regexes with backrefs (context sensitive grammars), the emptiness problem is undecidable. (I can't find a proof right now, but it's frequently mentioned in the literature) – rmmh Jul 30 '13 at 20:05

1 Answers1

13

This doesn't exactly answer your question, but hopefully explains a little why a simple answer is hard to come by:

First, the term 'regex' is a bit murky, so just to clarify, we have:

  • "Strict" regular expressions, which are equivalent to deterministic finite automatons (DFAs).
  • Perl-compatible regular expressions (PCREs), which add a bunch of bells and whistles such as lookaheads, backreferences, etc. These are implemented in other languages too, such as Python and Java.
  • Actual Perl regular expressions, which can get even more crazy, including arbitrary Perl code, via the ?{...} construct.

I think this problem is solvable for strict regular expressions. You just construct the corresponding DFA and search that graph to see if there's any path to a non-accept state. But that doesn't help for 'real world' regex, which is usually PCRE.

I don't think PCRE is Turing-complete (though I don't know - see this question, too: Are Perl regexes turing complete?). If it were, then I think as Jim Garrison commented, this is basically the halting problem. That said, it's not easy to transform them into a DFA, either, making the above method useless...

I don't have an answer for PCREs, but be aware that the aforementioned constructs (backreferences, etc) would make it pretty hard, I imagine. Though I hesitate to say "impossible."

A genuine Perl regex with ?{...} in it is definitely Turing-complete, so there be dragons, and I think you're out of luck.

Community
  • 1
  • 1
jwd
  • 10,837
  • 3
  • 43
  • 67
  • great reply. you've reinforced what i was thinking. for the use case that i'm addressing, any actual perl regular expressions would be what i care about. pretty much anything where `eval{ 'xx' =~ m/$regex/i; }` results in a successful eval. – Christopher Neylan Jul 30 '13 at 19:48