2

Having a Regex pattern regexPattern, how can I determine the length of the longest string that matches the regexPattern.

The imaginary int LongestLength(string pattern) should work like this:

Assert.Equals(LongestLength("[abc]"), 1);
Assert.Equals(LongestLength("(a|b)c?"), 2);

Assert.Equals(LongestLength("d*"), int.MaxValue); // Or throws NoLongestLengthException

Although the question is in C#, both C# and JavaScript answers are good.

mehrandvd
  • 8,806
  • 12
  • 64
  • 111
  • 2
    Beyond the simplest regular expressions, it is difficult to even figure out what strings can even match, much less the longest possible string. What you're asking for is either magic, or brute force. Neither of which is going to solve the problem. – willaien Jul 01 '15 at 21:47
  • *"how can I guess"* - What do you mean "guess"? – nnnnnn Jul 01 '15 at 21:53
  • @nnnnnn I mean how can determine or find out. Sorry I've just corrected the question. – mehrandvd Jul 01 '15 at 21:58
  • Convert the regex to a DFA? This certainly doesn't seem to be trivial. – Sean Bright Jul 01 '15 at 22:01
  • Do you want to handle any C# (or javascript) regex? If you limit the possibilities a bit, it is much easier. (Back references will be particularly annoying). – rici Jul 02 '15 at 00:12
  • @willaien: Please see my answer. – j_random_hacker Jul 02 '15 at 00:30
  • @rici As a matter of fact, this question is a prerequisite to this question: http://bit.ly/1RUoO4x Please check it if it simplifies the question. – mehrandvd Jul 02 '15 at 20:05
  • @mehrandvd: That other question (like this one) allows any Regexp, but it is often the case that restricting the possible regexp operators simplifies the problem. For example, if you can reduce the regexp to a DFA (which you can always do with mathematical regular expressions) then you don't need to do that kind of analysis; you just maintain the current state at the end of the chunk so that you can continue the match in the next chunk. Unfortunately, I know of no mainstream regex library which allows you to do "partial_match_returning_context", although it shouldn't be difficult to add. – rici Jul 02 '15 at 20:49

2 Answers2

4

It's pretty straightforward for a proper regex using just the operators ?, * and + and |, plus parentheses, character classes and of course ordinary characters. In fact even \1-style backreferences (which aren't part of the formal definition of a regex, and do complicate some questions about regexes) can be handled without problems.

A regex is just a compact encoding of a tree structure (similar to how mathematical formulas are compact encodings of tree structures describing arithmetic). Between every adjacent pair of characters there is an implicit "follows" operator that corresponds to a node with 2 children, one being the subregex just to its left, and the other being the entire rest of the regex; a sequence of subregexes separated by | characters corresponds to a single "alt" node with as many children as there are alternatives (i.e., one more than the number of | characters), while every other operator has just a single child (namely the subregex it operates on), and every ordinary character or character class has no children at all. To calculate the maximum-length matching string, you can just do a bottom-up traversal of this tree structure, at each node greedily assigning the length of the longest string that would match that node, given knowledge of the longest strings that would match its children.

The rules for deciding the length of the longest string that matches any given node in this tree are:

  • follows(x, y) (xy): maxlen(x) + maxlen(y)
  • alt(a, b, ..., z) (a|b|...|z): max(maxlen(a), maxlen(b), ..., maxlen(z))
  • maybe(x) (x?): maxlen(x)
  • rep(x) (x*) or posrep(x) (x+): infinity
  • Any other single character or character class ([...]): 1
  • \1-style backreferences: the maxlen for the corresponding parenthesised expression

One consequence is that the presence of * or + anywhere (unless escaped or part of a character class, obviously) will cause infinity to propagate up the tree until it hits the root.

Examples

Regex: abcd
"Function call syntax": follows(a, follows(b, follows(c, d)))
As a tree:

follows
 /  \
a  follows
    /  \
   b  follows
       /  \
      c    d

A second example:

Regex: (a|b|de)c?
"Function call" syntax: follows(alt(a, b, follows(d, e)), maybe(c))
As a tree:

     follows
     /     \
   alt     maybe
  / | \        \
 a  b follows   c
       /  \
      d    e

For this second regex/tree, a bottom-up traversal will assign a maxlen of 1 for the leaf nodes a, b, d, e and c; then the maxlen for the bottom follows() node is 1 + 1 = 2; then the maxlen for the alt() node is max(1, 1, 2) = 2; the maxlen for the maybe node is 1; the maxlen for the topmost follows() node, and thus for the entire regex, is 2 + 1 = 3.

If you mean regexes that allow other Perl-style enhanced features, it might get much more complicated, because a locally optimal choice of length may lead to a globally suboptimal one. (I had thought that it might even be possible that Perl-style extensions make regexes Turing complete, meaning that it would be in general impossible to decide whether there is any matching string -- but the discussion here seems to indicate this is not the case, unless of course you allow in the ?{...} construct.)

Community
  • 1
  • 1
j_random_hacker
  • 50,331
  • 10
  • 105
  • 169
  • This is the same as the **initial analysis** that regex engines may do to optimize the regex to quickly fail when the length guarantees no match can be found. Java uses similar analysis to check the length of the look-behind pattern to reject inf. length pattern (though there is a long standing bug about it). Anyway, beyond the syntax of theoretical regex, there is no reliable way to tell the length of the longest matching string. – nhahtdh Jul 02 '15 at 03:42
  • @nhahtdh: Nothing can be ruled out by determining the *longest* possible match as I do here. You could sometimes rule out a match by doing a similar calculation to compute the *shortest* possible match though. – j_random_hacker Jul 02 '15 at 09:47
  • Ah, yes, you are right. The longest possible match is only used for analyzing the look-behind pattern. The shortest possible match is used for initial length check. – nhahtdh Jul 02 '15 at 10:08
0

So how I would do this function is by first creating key value pair datatype. Then filling up the data type with every type of regex syntax. so the key would be a regex syntax (For example: "*"). The value would be how much it would add to the possible length of strings that match. So the key: "*" would have a value of int.maxvalue. So you would loop through your list and search through the regex expression that was passed in for any of the syntax and sum up all the values you find and return it. However you have to keep in mind some syntax are escaped so you can't count them. As well as that some of the syntax automatically make the possible length to int.maxvalue ("*", "+", etc..). So check these syntax first so you can automatically send back int.maxvalue as soon as you find one these types of regex syntax.

A.sharif
  • 1,977
  • 1
  • 16
  • 27