78

The following code compiles in both Java 8 & 9, but behaves differently.

class Simple {
    static String sample = "\nEn un lugar\r\nde la Mancha\nde cuyo nombre\r\nno quiero acordarme";

    public static void main(String args[]){
        String[] chunks = sample.split("\\R\\R");
        for (String chunk: chunks) {
            System.out.println("Chunk : "+chunk);
        }
    }
}

When I run it with Java 8 it returns:

Chunk : 
En un lugar
de la Mancha
de cuyo nombre
no quiero acordarme

But when I run it with Java 9 the output is different:

Chunk : 
En un lugar
Chunk : de la Mancha
de cuyo nombre
Chunk : no quiero acordarme

Why?

Germán Bouzas
  • 1,430
  • 1
  • 13
  • 18
  • 4
    Looks like in Java 8 `\R` is greedy, while in 9 it is not. –  Dec 18 '17 at 16:02
  • What string do you get from `System.getProperty("line.separator")`? – Sergey Kalinichenko Dec 18 '17 at 16:06
  • 2
    @dasblinkenlight: That shouldn't matter; `\R` is [the linebreak matcher](https://docs.oracle.com/javase/9/docs/api/java/util/regex/Pattern.html). It'll match whatever the OP has there. – Makoto Dec 18 '17 at 16:08
  • @Makoto Yet OP wants two `\R`s in a row. It looks like Java-8 treats `\r\n` as one line break marker, while Java-9 treats it as two line break markers. – Sergey Kalinichenko Dec 18 '17 at 16:09
  • I'd argue that something within the regex engine itself changed. [The documentation hasn't](https://docs.oracle.com/javase/9/docs/api/java/util/regex/Pattern.html) between versions, so there is definitely something going on with how the engine sees it (and the engine alone). I'm rifling through release notes to see what could've introduced this behavior but I'm not turning anything up. – Makoto Dec 18 '17 at 16:11
  • 2
    When posting this kind of question it's worth including the JDK version numbers because sometimes these are bugs fixed in point releases and then people cannot replicate etc. – Sled Dec 18 '17 at 17:37
  • 2
    @doublep I'm not sure you would call it greedy, but it is not allowed to backtrack and break a single CR LF sequence in two when matching `\R`, because that is forbidden from matching just a CR if there is LF following. Another way to express this is that it cannot backtrack. Java 8 was correct; Java 9 is now out of conformance with tr18 as far as I can discern. – tchrist Dec 19 '17 at 02:32
  • Germán you have a typo en tu Quijote! That *acordame* should read *acordarme* — [te lo juro](http://www.elmundo.es/quijote/capitulo.html?cual=1). :) – tchrist Dec 19 '17 at 02:35
  • yeah, noticed that... too late – Germán Bouzas Dec 19 '17 at 07:00

1 Answers1

49

The Java documentation is out of conformance with the Unicode Standard. The Javadoc mistates what \R is supposed to match. It reads:

\R Any Unicode linebreak sequence, is equivalent to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]

That Java documentation is buggy. In its section on R1.6 Line Breaks, Unicode Technical Standard #18 on Regular Expressions clearly states:

It is strongly recommended that there be a regular expression meta-character, such as "\R", for matching all line ending characters and sequences listed above (for example, in #1). This would correspond to something equivalent to the following expression. That expression is slightly complicated by the need to avoid backup.

 (?:\u{D A}|(?!\u{D A})[\u{A}-\u{D}\u{85}\u{2028}\u{2029}]

In other words, it can only match a two code-point CR+LF (carriage return + linefeed) sequence or else a single code-point from that set provided that it is not just a carriage return alone that is then followed by a linefeed. That’s because it is not allowed to back up. CRLF must be atomic for \R to function properly.

So Java 9 no longer conforms to what R1.6 strongly recommends. Moreover, it is now doing something that it was supposed to NOT do, and did not do, in Java 8.

Looks like it's time for me to give Sherman (read: Xueming Shen) a holler again. I've worked with him before on these nitty-gritty matters of formal conformance.

Community
  • 1
  • 1
tchrist
  • 78,834
  • 30
  • 123
  • 180
  • 2
    So a workaround would be to use either `(?>\\R)` or `\\R{1}+` instead of `\\R`, or in the OP’s specific case, use `\\R{2}+` instead of `\\R\\R`. Interestingly, even `\\R{1}\\R{1}` or `\\R{2}` give the desired result under Java 9, which is inconsistent, as the non-possessive `{n}` should not disable back-tracking. – Holger Dec 19 '17 at 12:25
  • Maybe this can get fixed with [JDK-8176983](https://bugs.openjdk.java.net/browse/JDK-8176983)? – Naman Dec 21 '17 at 01:16
  • @nullpointer can anyone tell me if this has been fixed in Java 10? It looks like the javadoc still has the wrong "equivalent" pattern, so at least the doc is wrong if not the implementation. – Patrick Parker Nov 12 '18 at 02:55