Computational complexity for regular expressions

Question

Regular expressions rapidly become too complex (for me) to understand. Even something so simple as [ab][cd], has several logical branches. My goal is to improve the maintainability of our code base, so answers to these questions could help us detect and fix complex code:

Are there computational complexity metrics (similar to cyclomatic complexity) that include the complexity inherent in regular expressions?
Are there any tools that produce a complexity number for regular expressions?
Are there tools that can suggest simplifications of regular expressions?

Complexity metrics? Yes, all times that you see a pattern similar to `((x*)*)*` (nested quantified patterns) it is complex. — revo, Feb 14 '19 at 14:16

score 2 · Answer 1 · answered Feb 14 '19 at 20:35

You might try using the compiled form of the regexp and try mapping some code complexity metrics to that, like, lines of code, or cyclomatic complexity. To see what I mean, look at the following stackoverflow answer: https://stackoverflow.com/a/2348725/5747415, which shows how with perl you can access the compiled form of a regexp. Another example is shown here: http://perldoc.perl.org/perldebguts.html#Debugging-Regular-Expressions, quoting the tool output from that page:

Compiling REx '[bc]d(ef*g)+h[ij]k$'
1: ANYOF[bc](12)
12: EXACT <d>(14)
14: CURLYX[0] {1,32767}(28)
16:   OPEN1(18)
18:     EXACT <e>(20)
20:     STAR(23)
21:       EXACT <f>(0)
23:     EXACT <g>(25)
25:   CLOSE1(27)
27:   WHILEM[1/1](0)
28: NOTHING(29)
29: EXACT <h>(31)
31: ANYOF[ij](42)
42: EXACT <k>(44)
44: EOL(45)
45: END(0)

Btw., I congratulate you to the decision to improve the code maintainability. That said, I just have to express my doubts that any formal metric provides a better guidance than (or can even get close to) the judgement of experienced developers...

You're right that many software metrics are on the simplistic side. However, they are good for helping to identify potential trouble-spots, particularly in large code bases. — kc2001, Feb 15 '19 at 12:18

score 1 · Answer 2 · answered Feb 18 '19 at 13:41

If you are dealing with a regular expression system equivalent to the formal regular expressions (which describe regular languages; no counting, no lookbehind, no matching pairs of parentheses, etc.), OR if you will only be dealing with regular expressions which use these features (despite your regex system being capable of describing non-regular languages), then there is a precise notion of complexity (or at least you could derive one) and there is a certain sense in which regular expressions can be "minimized".

By the Myhill-Nerode theorem, all regular languages have a finite number of equivalence classes under the indistinguishability relation on strings. These equivalence classes correspond directly to states in a minimal deterministic finite automaton for the regular language. You could take the number of states of a minimal deterministic finite automaton for the language to be the "fundamental" complexity of the language itself.

There are algorithms which can take you from a (formal) regular expression to a minimal deterministic finite automaton, and then back again to a regular expression. Doing this should give you a canonical regular expression for every regular language. I imagine - but have not worked out a proof - that the process of producing a regular expression from a minimal deterministic finite automaton could be modified so that it produces the shorted (in terms of number of operations) regular expression possible.

The complexity of the language could be the number of operations in such a canonical regular expression. The actual complexity of any given regular expression could be the number of operations in it. The ratio could give you a sense of how "inefficient" or "unnecessarily complex" your regular expression is.

If you really need non-reguar features of regex, then you are out of luck; there's no notion of computable minimization in higher order language classes. You can invent complexity metrics all day, but you'll never get a general algorithmic answer to "how inefficient is this compared to the baseline?" Another way to say what I mean is this: making a cake might be harder than making popcorn, but if you need a cake, you have to expend the extra effort to get what you need.

Upon reflection, the regular expressions obtained from minimal deterministic finite automata will not be minimal; regular expressions correspond more closely to nondeterministic finite automata which can be smaller than corresponding minimal deterministic automata. This doesn't change much, except that it becomes possible for your ratio to be less than one (if your regular expression relies on nondeterminism to describe languages more concisely than is possible to do purely deterministically). Otherwise, the process works identically. — Patrick87, Feb 19 '19 at 00:14

Computational complexity for regular expressions

2 Answers2

Linked