14

Executive summary: Are there any caveats/known issues with \R (or other regex pattern) usage in Java's Scanner (especially regarding internal buffer's boundary conditions)?

Details: Since I wanted to do some multi-line pattern matching on potentially multi-platform input files, I used patterns with \R, which according to Pattern javadoc is:

Any Unicode linebreak sequence, is equivalent to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]

Anyhow, I noticed in one of my test files that the loop that's supposed to parse a block of a hex-dump was cut short. After some debugging, I noticed that the line that it was ending on was the end of Scanner's internal buffer.

Here's a test program I wrote to simulate the situation:

public static void main(String[] args) throws IOException {
    testString(1);
    testString(1022);
}

private static void testString(int prefixLen) {
    String suffix = "b\r\nX";
    String buffer = new String(new char[prefixLen]).replace("\0", "a") + suffix;

    Scanner scanner = new Scanner(buffer);
    String pattern = "b\\R";
    System.out.printf("=================\nTest String (Len=%d): '%s'\n'%s' found with horizon=0 (w/o bound): %s\n", buffer.length(), convertLineEndings(
        buffer), pattern, convertLineEndings(scanner.findWithinHorizon(pattern, 0)));
    System.out.printf("'X' found with horizon=1: %b\n", scanner.findWithinHorizon("X", 1) != null);
    scanner.close();
}

private static String convertLineEndings(String string) {
    return string.replaceAll("\\n", "\\\\n").replaceAll("\\r", "\\\\r");
}

... which produces this output (edited for formatting/brevity):

=================
Test String (Len=5): 'ab\r\nX'
'b\R' found with horizon=0 (w/o bound): b\r\n
'X' found with horizon=1: true
=================
Test String (Len=1026): 'a ... ab\r\nX'
'b\R' found with horizon=0 (w/o bound): b\r
'X' found with horizon=1: false

To me, this looks like a bug! I think the scanner should match that suffix with the patterns the same way independent of where they show up in the input text (as long as the prefix doesn't get involved with the patterns). (I have also found possibly relevant Open JDK Bugs 8176407, and 8072582, but this was with regular Oracle JDK 8u111).

But I may have missed some recommendations regarding scanner or particular \R pattern usage (or that Open JDK, and Oracle have identical(??) implementations for relevant classes here?)... hence the question!

OzgurH
  • 443
  • 2
  • 13
  • I don't want to sound inappreciative, but it wasn't helpful for me (since I've already thought of enlarging the horizon, but didn't pick that as a "real solution" as that may not always be a viable option in the parsing logic). I do appreciate you taking the time to send an answer, but the main point with my question was that Scanner should not have acted differently based on the input length (or where its internal buffer ends/what it covers). It may still help others. It's up to you... – OzgurH Jul 19 '18 at 11:34
  • It wouldn't be the first bug in Java's regex methods: https://stackoverflow.com/a/49264884/3600709 – ctwheels Oct 24 '19 at 15:12

2 Answers2

1

I tested this code at Ideone and it's no longer returning "false" on latest versions of Java.

https://www.ideone.com/4wwYSj

If, however, I were stuck on an old version or one which still exhibits the bug, and I needed a general purpose solution rather than a workaround for this one example, then I might try crafting a regex similar to \R but which forces an extra byte peek in the \r case. Note that the so-called "equivalent" pattern in the documentation is not truly equivalent, because it actually needs to be an atomic grouping. So you might end up with something like this:

(?>\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029](?=.|\Z))

Patrick Parker
  • 4,863
  • 4
  • 19
  • 51
  • 1
    Thanks for the follow-up! I'm picking this as the "answer" as it confirms that the behavior seen was indeed a bug, that was "fixed" in later versions. Thinking it would be nice to know which version has the fix, I followed on your footsteps, and tried [the code on JDoodle](https://jdoodle.com/ia/eeW) which also allows for JDK version selection. There, it failed with **JDK 9.0.1**, but passed (returned "true") with **JDK 10.0.1** – OzgurH Jun 01 '21 at 14:15
0

Two suggestions:

I think you should test for X that way:

System.out.printf("'X' found with horizon=1: %b\n", 
    scanner.findWithinHorizon("X", prefixLen) != null);

(Since anything other than 0 as horizon parameter limits the search to a certain number of characters. That’s already in the name of the method. The horizon is as far as the method sees.)

Maybe there is a problem with your file encoding. Your scanner may pick the wrong default encoding. Try something along that lines:

new Scanner(file, "utf-8");
wp78de
  • 18,207
  • 7
  • 43
  • 71
  • Yes, enlarging the horizon (for 'X') was an available option in my business logic, and that was the workaround I've chosen as well, but that may not always be the case. If the previous code has "skipped" the new lines, the next find should be able to assume that's the case, and act accordingly. The encoding is not an issue here (it would have been if I only run into this with a file), but as you can see from the sample code above, it does happen with ordinary literal Java strings (with characters from standard ASCII). So what you see in the code should be what you (scanner) gets!... – OzgurH Mar 05 '18 at 09:50