30

It is possible to write a Regex which needs in some cases exponential running time. Such an example is (aa|aa)*. If there is an input of an odd number of as it needs exponential running time.

It is easy to test this. If the input contains only as and has length 51, the Regex needs some seconds to compute (on my machine). Instead if the input length is 52 its computing time is not noticeable (I tested this with the built-in Regex-parser of the JavaRE).

I have written a Regex-parser to find the reason for this behavior, but I didn't find it. My parser can build an AST or a NFA based on a Regex. After that it can translate the NFA to a DFA. To do this it uses the powerset construction algorithm.

When I parse the Rgex mentioned above, the parser creates a NFA with 7 states - after conversion there are only 3 states left in the DFA. The DFA represents the more sensible Regex (aa)*, which can be parsed very fast.

Thus, I don't understand why there are parsers which can be so slow. What is the reason for this? Do they not translate the NFA to a DFA? If yes, why not? And what's the technical reasons why they compute so slow?

kiritsuku
  • 52,967
  • 18
  • 114
  • 136
  • 1
    http://stackoverflow.com/questions/844183/python-regular-expression-implementation-details Looks like there are some discussion in the past – lucemia Jan 18 '12 at 18:59

2 Answers2

22

Russ Cox has a very detailed article about why this is and the history of regexes (part 2, part 3).

Regular expression matching can be simple and fast, using finite automata-based techniques that have been known for decades. In contrast, Perl, PCRE, Python, Ruby, Java, and many other languages have regular expression implementations based on recursive backtracking that are simple but can be excruciatingly slow. With the exception of backreferences, the features provided by the slow backtracking implementations can be provided by the automata-based implementations at dramatically faster, more consistent speeds.

Largely, it comes down to proliferation of non-regular features in "regular" expressions such as backreferences, and the (continued) ignorance of most programmers that there are better alternatives for regexes that do not contain such features (which is many of them).

While writing the text editor sam in the early 1980s, Rob Pike wrote a new regular expression implementation, which Dave Presotto extracted into a library that appeared in the Eighth Edition. Pike's implementation incorporated submatch tracking into an efficient NFA simulation but, like the rest of the Eighth Edition source, was not widely distributed. Pike himself did not realize that his technique was anything new. Henry Spencer reimplemented the Eighth Edition library interface from scratch, but using backtracking, and released his implementation into the public domain. It became very widely used, eventually serving as the basis for the slow regular expression implementations mentioned earlier: Perl, PCRE, Python, and so on. (In his defense, Spencer knew the routines could be slow, and he didn't know that a more efficient algorithm existed. He even warned in the documentation, “Many users have found the speed perfectly adequate, although replacing the insides of egrep with this code would be a mistake.”) Pike's regular expression implementation, extended to support Unicode, was made freely available with sam in late 1992, but the particularly efficient regular expression search algorithm went unnoticed.

  • 1
    This is not the answer I expected - a reason like "the implementation is based on inefficient algorithms" is not very satisfying. However thank you very much for this great post. – kiritsuku Jan 17 '12 at 17:07
  • 9
    Is this answer suggesting that most regex engines are **not** based on the fast-and-simple finite-state automata taught in every Theory of Computation course ever? And that it's because most people didn't know about them in the 80's/90's? I find that *very* hard to believe - hasn't Theory of Computation been a part of every Computer Science curriculum since the dawn of time? – BlueRaja - Danny Pflughoeft Jan 17 '12 at 20:37
  • 2
    @BlueRaja: Are you denying that most (by market share) regular expression implementations use backtracking implementations? If you don't trust my excerpts you could, like, read the articles. Or the source code for most of the projects. Or do some benchmarks yourself, like Antoras did. –  Jan 17 '12 at 21:18
  • @Joe: No, sorry I don't know much about regex implementations. But it sounds above like you're saying backtracking is completely inferior to FSA, and that they all use this inferior method because no one knew about the superior method, FSA's, which is taught in every Computer Science curriculum ever. I'm saying I find that very hard to believe. If they all use back-tracking, there *must* be a reason to prefer it. – BlueRaja - Danny Pflughoeft Jan 17 '12 at 21:22
  • 4
    @BlueRaja: Backtracking is easier to implement. It also admits non-regular features (but as I said below that's a cop-out because checking for non-regular features can be done regularly). Really, that's pretty much it. You vastly overestimate how much people remember from school, how clever they are practically even if they remember that FSAs recognize regular languages, and how many people are ever going to be called on to actually pick a regex library rather than use the one already given to them by the language. (Or seriously, just read the article. Why doesn't anyone read links anymore?) –  Jan 17 '12 at 21:37
  • 1
    If one is given a single regex string and a single string against which to match it, the worst-case time of a backtracking solution may be very bad, but in many cases the execution time of a backtracking solution may be better than the time required to build an FSA for the regex. If the regex won't be reused, the optimal approach may be to start with a backtracking solution, but limit the number of steps it performs before giving up and switching to an FSA. – supercat Sep 22 '14 at 22:07
  • When I took a computation theory class, I was so interested in the use of Finite Automata for regular expression matching that I wrote a simple, minimal implementation of a regex engine, and I found that it was simple to generate and NFA, but the conversion to a DFA had the potential to explode into a huge state space, particularly in an expression with a lot of union operators, and while the recognition was fast, it was very slow to compile the regex – TallChuck Nov 16 '18 at 15:05
  • May be related https://www.regular-expressions.info/catastrophic.html – Peter Franek Aug 21 '20 at 22:10
2

Regular expressions conforming to this formal definition are computable in linear time, because they have corresponding finite automatas. They are built only from parentheses, alternative | (sometimes called sum), Kleene star * and concatenation.

Extending regular expressions by adding, for example, backward references can lead even to NP-complete regular expressions. Here you can find an example of regular expression recognizing non-prime numbers.

I guess that, such an extended implementation can have non-linear matching time even in simple cases.

I made a quick experiment in Perl and your regular expression computes equally fast for odd and even number of 'a's.

Rafał Rawicki
  • 22,324
  • 5
  • 59
  • 79
  • One can construct REs that trigger pathological performance cases in the commonly-used libraries without resorting to non-regular features like backreferences, e.g. `(.*) (.*) (.*) (.*) (.*)`. See the article in my answer. –  Jan 16 '12 at 23:41
  • 1
    "such an extended implementation can have non-linear matching time even in simple cases." Only if the implementer isn't paying attention. Checking whether an re contains backreferences requires only linear time. At that point you can decide to punt it to the regular expression matcher, or the the non-regular "regular expression" matcher. –  Jan 16 '12 at 23:43
  • Regarding your benchmark - for years now Perl has memoized its backtracking leading to linear time for such matches at the cost of memory (and at some size it switches back to exponential CPU). I suspect that is what you're seeing. (Or maybe it finally fixed its parser? It's very possible I missed the news.) –  Jan 16 '12 at 23:53
  • 1
    Well, it is wrong stating that it is NP-complete to recognize prime numbers. Search for "PRIMES is in P". It is even more wrong after I actually read the article. Given N, PRIMES is in P mean it takes polynomial of log(N) to check if N is prime or not. The article you mentioned takes O(N) to check even if RE is linear time. You should study the computation theory again. – Kan Li Aug 20 '13 at 20:52