It's pretty straightforward for a proper regex using just the operators ?
, *
and +
and |
, plus parentheses, character classes and of course ordinary characters. In fact even \1
-style backreferences (which aren't part of the formal definition of a regex, and do complicate some questions about regexes) can be handled without problems.
A regex is just a compact encoding of a tree structure (similar to how mathematical formulas are compact encodings of tree structures describing arithmetic). Between every adjacent pair of characters there is an implicit "follows" operator that corresponds to a node with 2 children, one being the subregex just to its left, and the other being the entire rest of the regex; a sequence of subregexes separated by |
characters corresponds to a single "alt" node with as many children as there are alternatives (i.e., one more than the number of |
characters), while every other operator has just a single child (namely the subregex it operates on), and every ordinary character or character class has no children at all. To calculate the maximum-length matching string, you can just do a bottom-up traversal of this tree structure, at each node greedily assigning the length of the longest string that would match that node, given knowledge of the longest strings that would match its children.
The rules for deciding the length of the longest string that matches any given node in this tree are:
- follows(x, y) (
xy
): maxlen(x) + maxlen(y)
- alt(a, b, ..., z) (
a|b|...|z
): max(maxlen(a), maxlen(b), ..., maxlen(z))
- maybe(x) (
x?
): maxlen(x)
- rep(x) (
x*
) or posrep(x) (x+
): infinity
- Any other single character or character class (
[...]
): 1
\1
-style backreferences: the maxlen for the corresponding parenthesised expression
One consequence is that the presence of *
or +
anywhere (unless escaped or part of a character class, obviously) will cause infinity to propagate up the tree until it hits the root.
Examples
Regex: abcd
"Function call syntax": follows(a, follows(b, follows(c, d)))
As a tree:
follows
/ \
a follows
/ \
b follows
/ \
c d
A second example:
Regex: (a|b|de)c?
"Function call" syntax: follows(alt(a, b, follows(d, e)), maybe(c))
As a tree:
follows
/ \
alt maybe
/ | \ \
a b follows c
/ \
d e
For this second regex/tree, a bottom-up traversal will assign a maxlen of 1 for the leaf nodes a, b, d, e and c; then the maxlen for the bottom follows() node is 1 + 1 = 2; then the maxlen for the alt() node is max(1, 1, 2) = 2; the maxlen for the maybe node is 1; the maxlen for the topmost follows() node, and thus for the entire regex, is 2 + 1 = 3.
If you mean regexes that allow other Perl-style enhanced features, it might get much more complicated, because a locally optimal choice of length may lead to a globally suboptimal one. (I had thought that it might even be possible that Perl-style extensions make regexes Turing complete, meaning that it would be in general impossible to decide whether there is any matching string -- but the discussion here seems to indicate this is not the case, unless of course you allow in the ?{...}
construct.)