After looking at the Pattern
code and trying to trace it, I'm convinced that this is just a bug. Both examples should result in an exception. But the logic that tests for this is incorrect.
This code appears in a couple places:
temp = info.maxLength * cmax + maxL;
info.maxLength = temp;
if (temp < maxL) {
info.maxValid = false;
}
Note that as long as maxLength
and cmax
are nonnegative, temp
should never be less than maxL
unless overflow has occurred. maxValid
is eventually used by the lookbehind code; if maxValid
is false
, "look-behind group does not have obvious maximum length error"
is thrown.
From what I can tell, in a regex like <prefix> <expression>{m,n}
, in the above code, info.maxLength
is the maximum length of "expression", cmax
is the upper bound of the quantifier, and maxL
is the maximum length of "prefix". When a quantifier is *
or +
, the upper bound is set to Integer.MAX_VALUE
. (All the variables here are int
.) This means that there will be overflow unless info.maxLength
is 1 and maxL
is 0. That's exactly the case with
(?<=a*)bc
since the pattern with the quantifier has length 1, and there is nothing before the a*
which means maxL
will be 0. That's why this case falls through the cracks.
For any other values, the computation will overflow, but that doesn't necessarily mean temp < maxL
will be true. If info.maxLength
is even, then an exception will be thrown; but if info.maxLength
is odd, the pattern will compile if maxL
is small enough. This is because of the way wraparound works, mathematically; the code that attempts to check for overflow is quite faulty. This means that
(?<=a*)bc // succeeds
(?<=(ab)*)bc // throws exception
(?<=(abc)*)bc // succeeds
(?<=(abcd)*)bc // throws exception
(?<=(abcde)*)bc // succeeds
Also:
(?<=xa*)bc // throws exception
(?<=x(abc)*)bc // succeeds
Note: It should be noted that in your example regex, lookbehind is useless:
(?<=a*)bc
The lookbehind says to test whether the current location is preceded by zero or more occurrences of the letter a
. This is always true, trivially. So the lookbehind serves no purpose here. Similarly,
(?<=a+)bc
is equivalent to
(?<=a)bc
since as long as there's one a
preceding the current location, it doesn't matter how many more there might be. The same is true of all the examples that don't throw an exception, except for this one:
(?<=x(abc)*)bc // succeeds
since here the matcher does have to go backwards looking for abc
's in the string and making sure the last one is preceded by x
. It seems that Pattern
should throw an exception in this case, but because of the faulty overflow-check logic, it isn't. Therefore, I'm uncertain whether it would actually return the correct result, or whether it might crash or go into an infinite loop. However, I haven't tried it.
Really, the code should just check directly whether cmax
is equal to Integer.MAX_VALUE
, instead of using it in computation and hoping the code can tell later that the result is bogus.