Executive summary: Are there any caveats/known issues with \R
(or other regex pattern) usage in Java's Scanner
(especially regarding internal buffer's boundary conditions)?
Details: Since I wanted to do some multi-line pattern matching on potentially multi-platform input files, I used patterns with \R
, which according to Pattern
javadoc is:
Any Unicode linebreak sequence, is equivalent to
\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
Anyhow, I noticed in one of my test files that the loop that's supposed to parse a block of a hex-dump was cut short. After some debugging, I noticed that the line that it was ending on was the end of Scanner's internal buffer.
Here's a test program I wrote to simulate the situation:
public static void main(String[] args) throws IOException {
testString(1);
testString(1022);
}
private static void testString(int prefixLen) {
String suffix = "b\r\nX";
String buffer = new String(new char[prefixLen]).replace("\0", "a") + suffix;
Scanner scanner = new Scanner(buffer);
String pattern = "b\\R";
System.out.printf("=================\nTest String (Len=%d): '%s'\n'%s' found with horizon=0 (w/o bound): %s\n", buffer.length(), convertLineEndings(
buffer), pattern, convertLineEndings(scanner.findWithinHorizon(pattern, 0)));
System.out.printf("'X' found with horizon=1: %b\n", scanner.findWithinHorizon("X", 1) != null);
scanner.close();
}
private static String convertLineEndings(String string) {
return string.replaceAll("\\n", "\\\\n").replaceAll("\\r", "\\\\r");
}
... which produces this output (edited for formatting/brevity):
=================
Test String (Len=5): 'ab\r\nX'
'b\R' found with horizon=0 (w/o bound): b\r\n
'X' found with horizon=1: true
=================
Test String (Len=1026): 'a ... ab\r\nX'
'b\R' found with horizon=0 (w/o bound): b\r
'X' found with horizon=1: false
To me, this looks like a bug! I think the scanner should match that suffix
with the patterns the same way independent of where they show up in the input text (as long as the prefix
doesn't get involved with the patterns). (I have also found possibly relevant Open JDK Bugs 8176407, and 8072582, but this was with regular Oracle JDK 8u111).
But I may have missed some recommendations regarding scanner or particular \R
pattern usage (or that Open JDK, and Oracle have identical(??) implementations for relevant classes here?)... hence the question!