Algorithmic complexity of Regular Languages in Extended Regular Language frameworks

Question

I have a bit of a background in Formal Languages and recently I found out that Java and other languages use what is coined Extended Regular Languages. Because of my background, I had always assumed in languages like Java when I call compile for Pattern it produced a DFA or Transducer in the background. As a result, I had always assumed no matter how ugly my regex, no matter how long my regex, Pattern.matches or similar methods would run in linear time. But that assumption seems to be incorrect.

A post I read seems to suggest some Regex expressions do run in linear time, but I don't entirely believe or trust one single person.

I'll eventually write my own Java Formal Regular Expression library (existing ones I've found only have GNU GPL licences) but in the meantime I have a few questions on the time complexity of Java/C# regexs. Want to ensure what I read elsewhere is correct.

Questions:

A formal language like \sRT\s., would the match method of a regexes in Java/C# solve membership in linear or non-linear time?
In general, how would I know if a given Regular Language's expression membership problem is linear time for a Regex?

I do text analysis, finding out Java regexes aren't DFA's was really a downer.

I learned about DFAs a long time ago, and my recollection is that they could tell you whether a pattern matched, but they couldn't return the additional information needed for capture groups which are _very_ important. Capture groups are the reason regexes have "greedy" and "reluctant" qualifiers (i.e. `.*` vs. `.*?`); those would be irrelevant if the only purpose of a match was to return true/false, and they'd be irrelevant to a DFA. That was decades ago, however, and I don't know how much DFA theory has evolved since. — ajb, Dec 11 '13 at 20:15
@ajb You may have been dealing with simplified DFAs. For example, Mealy Automata can be used to output the equivalent of capture groups. I believe the only difference between the power of Extended Regular Expressions and Formal Regular Languages is backreferences. — Lan, Dec 11 '13 at 21:25
You might be interested in [this series of articles](http://swtch.com/~rsc/regexp/). — Martin Ender, Dec 16 '13 at 19:06

score 6 · Answer 1 · answered Dec 11 '13 at 22:13

Because of my background, I had always assumed in languages like Java when I call compile for Pattern it produced a DFA or Transducer in the background.

This belief is common in academics. In practice, regular expression compilation does not produce a DFA and then execute it. I have only a small amount of experience in this; I worked briefly on the regular expression compilation system in Microsoft's implementation of JavaScript in the 1990s. We chose to compile the "regular" expression into a simple domain-specific bytecode language and then built an interpreter for that language.

As you note, this can lead to situations where repeated backtracking has exponentially bad time behavior in the length of the input, but the construction of the compiled state is essentially linear in the size of the expression.

So let me counter your questions with another two questions -- and I note that these are genuine questions, not rhetorical.

1) Every actually regular expression corresponds to an NDFA with let's say n states. The corresponding DFA might require superposition of up to 2ⁿ states. So what stops the time taken to construct the DFA from being exponential in pathological cases? The runtime might be linear in the input, but if the runtime is exponential in the pattern size then basically you're just trading one nonlinearity for another.

2) So-called "regular" expressions these days are nothing of the sort; they can do parenthesis matching. They correspond to pushdown automata, not nondeterministic finite automata. Is there a linear algorithm for constructing the corresponding pushdown automaton for a "regular" expression?

Hello, thank-you for the questions. For 1) 'usually' your regular expression is smaller than the strings you are matching. So just going by the big-O, construction an DFA to use for one match may be faster than Java's Regex. Even if the state count in the NFA is larger than input strings, the O(2^n) construction time is dwarfed with a God's Constant when you start throwing millions of strings at the DFA every day. But let's say we stick with an NFA with n states, the membership problem is only O(nm) where m is the size of the input string. — Lan, Dec 12 '13 at 03:25
2) This comes from a bit of recent results. Both directions of CFG <-> Pushdown can be done in linear time & space. Extended Regular Expressions are not standardized. So for Regexes that are CFG, you can construct a PDA in linear time (assuming Regex -> CFG only expands the input size by a constant). But here is the thing, not all programming languages have Regexes that are CFG!!!! So for them, you can't translate them to PDA's period. Ruby is a little devil: http://stackoverflow.com/a/8964806/1541137 — Lan, Dec 12 '13 at 03:46

score 2 · Answer 2 · answered Dec 11 '13 at 19:55

2

Regular expressions are implemented in many languages as NFAs to support backtracking (see http://msdn.microsoft.com/en-us/library/e347654k(v=vs.110).aspx). Because of backtracking, you can construct regular expressions which have terrible performance on some strings (see http://www.regular-expressions.info/catastrophic.html)

As far as analysing regular expressions to determine their performance, I doubt there is a very good way in general. You can look for warning flags, like compounded backtracking in the second link's example, but even that may be hard to detect properly in some cases.

answered Dec 11 '13 at 19:55

Sean

4,450
25
22

"Regular expressions are implemented [...] to support backtracking". Backtracking isn't really a feature, it's an implementation detail. Are you referring to back-referencing? – Kobi Dec 17 '13 at 13:50
@Kobi I think what he means is that back-references mean they need to backtrack. What surprises me is that when there are no back-references in a regex Java, C#, and many other languages still uses a backtracking approach (in a true NFA without backreferences you don't need backtracking). – Lan Dec 20 '13 at 19:58

score 1 · Accepted Answer · answered Dec 20 '13 at 00:02

1

Java's implementation of regular expression uses the NFA approach. Here is a link that explains it clearly.

Basically, a badly written but still correct regular expression can cause the engine to perform badly. For example, given the expression (a+a+)+b and the string aaaaaaaaaaaaaaa. It can take a while (depending on your machine, from several seconds to minutes) to figure out that there is no match.

The worst case performance of NFA is an almost match situation (given example). Because the expression forces the engine to explore every path (with lots of backtracking) to determine a non-match.

answered Dec 20 '13 at 00:02

Christopher Z

899
1
12
32

The article uses the wrong term, which it explains by "NFA engines were given more power, making them superior to DFA engines in computation power." Membership for NFA is still a linear problem, so "exponential" time membership implies it isn't a true NFA. It still provides support for http://stackoverflow.com/questions/4378455/what-is-the-complexity-of-regular-expression/4378549#4378549 as I requested in my original answer. Thank-you very much. – Lan Dec 20 '13 at 18:37

Algorithmic complexity of Regular Languages in Extended Regular Language frameworks

3 Answers3