25

I always thought that a look-behind assertion in Java's regex-API (and many other languages for that matter) must have an obvious length. So, STAR and PLUS quantifiers are not allowed inside look-behinds.

The excellent online resource regular-expressions.info seems to confirm (some of) my assumptions:

"[...] Java takes things a step further by allowing finite repetition. You still cannot use the star or plus, but you can use the question mark and the curly braces with the max parameter specified. Java recognizes the fact that finite repetition can be rewritten as an alternation of strings with different, but fixed lengths. Unfortunately, the JDK 1.4 and 1.5 have some bugs when you use alternation inside lookbehind. These were fixed in JDK 1.6. [...]"

-- http://www.regular-expressions.info/lookaround.html

Using the curly brackets works as long as the total length of range of the characters inside the look-behind is smaller or equal to Integer.MAX_VALUE. So these regexes are valid:

"(?<=a{0,"   +(Integer.MAX_VALUE)   + "})B"
"(?<=Ca{0,"  +(Integer.MAX_VALUE-1) + "})B"
"(?<=CCa{0," +(Integer.MAX_VALUE-2) + "})B"

But these aren't:

"(?<=Ca{0,"  +(Integer.MAX_VALUE)   +"})B"
"(?<=CCa{0," +(Integer.MAX_VALUE-1) +"})B"

However, I don't understand the following:

When I run a test using the * and + quantifier inside a look-behind, all goes well (see output Test 1 and Test 2).

But, when I add a single character at the start of the look-behind from Test 1 and Test 2, it breaks (see output Test 3).

Making the greedy * from Test 3 reluctant has no effect, it still breaks (see Test 4).

Here's the test harness:

public class Main {

    private static String testFind(String regex, String input) {
        try {
            boolean returned = java.util.regex.Pattern.compile(regex).matcher(input).find();
            return "testFind       : Valid   -> regex = "+regex+", input = "+input+", returned = "+returned;
        } catch(Exception e) {
            return "testFind       : Invalid -> "+regex+", "+e.getMessage();
        }
    }

    private static String testReplaceAll(String regex, String input) {
        try {
            String returned = input.replaceAll(regex, "FOO");
            return "testReplaceAll : Valid   -> regex = "+regex+", input = "+input+", returned = "+returned;
        } catch(Exception e) {
            return "testReplaceAll : Invalid -> "+regex+", "+e.getMessage();
        }
    }

    private static String testSplit(String regex, String input) {
        try {
            String[] returned = input.split(regex);
            return "testSplit      : Valid   -> regex = "+regex+", input = "+input+", returned = "+java.util.Arrays.toString(returned);
        } catch(Exception e) {
            return "testSplit      : Invalid -> "+regex+", "+e.getMessage();
        }
    }

    public static void main(String[] args) {
        String[] regexes = {"(?<=a*)B", "(?<=a+)B", "(?<=Ca*)B", "(?<=Ca*?)B"};
        String input = "CaaaaaaaaaaaaaaaBaaaa";
        int test = 0;
        for(String regex : regexes) {
            test++;
            System.out.println("********************** Test "+test+" **********************");
            System.out.println("    "+testFind(regex, input));
            System.out.println("    "+testReplaceAll(regex, input));
            System.out.println("    "+testSplit(regex, input));
            System.out.println();
        }
    }
}

The output:

********************** Test 1 **********************
    testFind       : Valid   -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = true
    testReplaceAll : Valid   -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = CaaaaaaaaaaaaaaaFOOaaaa
    testSplit      : Valid   -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = [Caaaaaaaaaaaaaaa, aaaa]

********************** Test 2 **********************
    testFind       : Valid   -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = true
    testReplaceAll : Valid   -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = CaaaaaaaaaaaaaaaFOOaaaa
    testSplit      : Valid   -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = [Caaaaaaaaaaaaaaa, aaaa]

********************** Test 3 **********************
    testFind       : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
      ^
    testReplaceAll : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
      ^
    testSplit      : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
      ^

********************** Test 4 **********************
    testFind       : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
       ^
    testReplaceAll : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
       ^
    testSplit      : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
       ^

My question may be obvious, but I'll still ask it: Can anyone explain to me why Test 1 and 2 fail, and Test 3 and 4 don't? I would have expected them all to fail, not half of them to work and half of them to fail.

Thanks.

PS. I'm using: Java version 1.6.0_14

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288

2 Answers2

18

Glancing at the source code for Pattern.java reveals that the '*' and '+' are implemented as instances of Curly (which is the object created for curly operators). So,

a*

is implemented as

a{0,0x7FFFFFFF}

and

a+

is implemented as

a{1,0x7FFFFFFF}

which is why you see exactly the same behaviors for curlies and stars.

Jonathan Feinberg
  • 44,698
  • 7
  • 80
  • 103
  • It didn't occur to me to look at the source of Pattern... Silly me. Thank you! It makes perfect sense now. – Bart Kiers Oct 08 '09 at 12:03
  • One of the many reasons I can't live without Eclipse is my itchy ctrl-click finger (which, if you don't use Eclipse, means "open up the source file where this name is defined"). – Jonathan Feinberg Oct 08 '09 at 12:18
  • Thanks, I've never taken the trouble to attach the source to Eclipse. I will do so now for sure. Thanks. – Bart Kiers Oct 08 '09 at 12:21
13

It's a bug: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6695369

Pattern.compile() is always supposed to throw an exception if it can't determine the maximum possible length of a lookbehind match.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • Hey Alan, figured you'd come along soon! Thanks, I knew this used to be a bug (in 1.5) but I remembered that the engine always evaluated to false or something. Now it does evaluate the result correctly, so I figured the bug got fixed in 1.6. I must've remembered incorrectly. Thanks for the info. Bart (aka prometheuzz) – Bart Kiers Oct 08 '09 at 17:21
  • That bug got fixed; this is a new one that was introduced in 1.6. :/ – Alan Moore Oct 09 '09 at 15:16
  • Link should now be [https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6695369](https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6695369) -- tried to edit it in but the edit queue is apparently full. – Gavin S. Yancey Feb 22 '22 at 23:58
  • @GavinS.Yancey, I fixed it. – Alan Moore Mar 20 '22 at 08:07
  • It's now https://bugs.java.com/bugdatabase/view_bug?bug_id=6695369 – mems Jun 15 '23 at 08:52