116

Before Java 8 when we split on empty string like

String[] tokens = "abc".split("");

split mechanism would split in places marked with |

|a|b|c|

because empty space "" exists before and after each character. So as result it would generate at first this array

["", "a", "b", "c", ""]

and later will remove trailing empty strings (because we didn't explicitly provide negative value to limit argument) so it will finally return

["", "a", "b", "c"]

In Java 8 split mechanism seems to have changed. Now when we use

"abc".split("")

we will get ["a", "b", "c"] array instead of ["", "a", "b", "c"].

My first guess was that maybe now leading empty strings are also removed just like trailing empty strings.

But this theory fails, since

"abc".split("a")

returns ["", "bc"], so leading empty string was not removed.

Can someone explain what is going on here? How rules of split have changed in Java 8?

Pshemo
  • 122,468
  • 25
  • 185
  • 269
  • Java8 seems to fix that. Meanwhile, `s.split("(?!^)")` seems to work. – shkschneider Oct 21 '14 at 12:49
  • 3
    @shkschneider Behaviour described in my question is not a bug of pre Java-8 versions. This behaviour was not particularly very useful, but it still was correct (as shown in my question), so we can't say that it was "fixed". I see it more like improvement so we could use `split("")` instead of cryptic (for people who don't use regex) `split("(?!^)")` or `split("(?<!^)")` or few others regexes. – Pshemo Oct 21 '14 at 14:31
  • 1
    Encountered same issue after upgraded fedora to Fedora 21, fedora 21 ships with JDK 1.8, and my IRC game application is broken because of this. – LiuYan 刘研 Dec 16 '14 at 07:26
  • 8
    This question seems to be the only documentation of this breaking change in Java 8. Oracle left it out of their [list of incompatibilities](http://www.oracle.com/technetwork/java/javase/8-compatibility-guide-2156366.html). – Sean Van Gorder Jun 01 '15 at 19:22
  • 5
    This change in the JDK just cost me 2 hours of tracking down what is wrong. The code runs fine in my computer (JDK8) but fails mysteriously on another machine (JDK7). Oracle ***REALLY SHOULD*** update the documentation of ***String.split(String regex)***, rather than in Pattern.split or String.split(String regex, int limit) as this is by far the most common usage. Java is known for its portability aka so-called WORA. This is a major backward-breaking change and not well documented at all. – PoweredByRice Oct 04 '15 at 01:11
  • @Nhan Yes, I also had problems with finding any informations about this change, hence this question. Anyway if you are looking for a way which will work in all versions instead of `split("")` use `split("(?!^)")` - it will try to split on each empty string except the one at start of text. BTW other change introduced in Java 8 in regex engine is `\R` which represents `\n` `\r` or `\r\n` (and few other separators). – Pshemo Oct 04 '15 at 01:21
  • guess what you will get when using `"".split ("")`, tada, **"an empty leading substring is included at the beginning of the resulting array."** – LiuYan 刘研 Jan 02 '16 at 08:17
  • @LiuYan刘研 I suspect it is same case as `"".split(",")` explained here: http://stackoverflow.com/a/25058091/1393766. In short, removing empty strings from start or end of result array makes sense ***only when their existence was result of split***. But in case of `""` we know that we can't split if farther so just like `"a".split("b")` returns array with original string `["a"]` for `"".split("whatever")` we are getting `[""]` (because split didn't need to happen). – Pshemo Jan 02 '16 at 13:59
  • @Pshemo, indeed, as the code snippet in the answer you chose indicated: `if (index == 0) {return new String[] {input.toString()};}`. I wish JDK8 javadoc can add detail document about this. – LiuYan 刘研 Jan 04 '16 at 14:59

3 Answers3

88

The behavior of String.split (which calls Pattern.split) changes between Java 7 and Java 8.

Documentation

Comparing between the documentation of Pattern.split in Java 7 and Java 8, we observe the following clause being added:

When there is a positive-width match at the beginning of the input sequence then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.

The same clause is also added to String.split in Java 8, compared to Java 7.

Reference implementation

Let us compare the code of Pattern.split of the reference implemetation in Java 7 and Java 8. The code is retrieved from grepcode, for version 7u40-b43 and 8-b132.

Java 7

public String[] split(CharSequence input, int limit) {
    int index = 0;
    boolean matchLimited = limit > 0;
    ArrayList<String> matchList = new ArrayList<>();
    Matcher m = matcher(input);

    // Add segments before each match found
    while(m.find()) {
        if (!matchLimited || matchList.size() < limit - 1) {
            String match = input.subSequence(index, m.start()).toString();
            matchList.add(match);
            index = m.end();
        } else if (matchList.size() == limit - 1) { // last one
            String match = input.subSequence(index,
                                             input.length()).toString();
            matchList.add(match);
            index = m.end();
        }
    }

    // If no match was found, return this
    if (index == 0)
        return new String[] {input.toString()};

    // Add remaining segment
    if (!matchLimited || matchList.size() < limit)
        matchList.add(input.subSequence(index, input.length()).toString());

    // Construct result
    int resultSize = matchList.size();
    if (limit == 0)
        while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
            resultSize--;
    String[] result = new String[resultSize];
    return matchList.subList(0, resultSize).toArray(result);
}

Java 8

public String[] split(CharSequence input, int limit) {
    int index = 0;
    boolean matchLimited = limit > 0;
    ArrayList<String> matchList = new ArrayList<>();
    Matcher m = matcher(input);

    // Add segments before each match found
    while(m.find()) {
        if (!matchLimited || matchList.size() < limit - 1) {
            if (index == 0 && index == m.start() && m.start() == m.end()) {
                // no empty leading substring included for zero-width match
                // at the beginning of the input char sequence.
                continue;
            }
            String match = input.subSequence(index, m.start()).toString();
            matchList.add(match);
            index = m.end();
        } else if (matchList.size() == limit - 1) { // last one
            String match = input.subSequence(index,
                                             input.length()).toString();
            matchList.add(match);
            index = m.end();
        }
    }

    // If no match was found, return this
    if (index == 0)
        return new String[] {input.toString()};

    // Add remaining segment
    if (!matchLimited || matchList.size() < limit)
        matchList.add(input.subSequence(index, input.length()).toString());

    // Construct result
    int resultSize = matchList.size();
    if (limit == 0)
        while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
            resultSize--;
    String[] result = new String[resultSize];
    return matchList.subList(0, resultSize).toArray(result);
}

The addition of the following code in Java 8 excludes the zero-length match at the beginning of the input string, which explains the behavior above.

            if (index == 0 && index == m.start() && m.start() == m.end()) {
                // no empty leading substring included for zero-width match
                // at the beginning of the input char sequence.
                continue;
            }

Maintaining compatibility

Following behavior in Java 8 and above

To make split behaves consistently across versions and compatible with the behavior in Java 8:

  1. If your regex can match zero-length string, just add (?!\A) at the end of the regex and wrap the original regex in non-capturing group (?:...) (if necessary).
  2. If your regex can't match zero-length string, you don't need to do anything.
  3. If you don't know whether the regex can match zero-length string or not, do both the actions in step 1.

(?!\A) checks that the string does not end at the beginning of the string, which implies that the match is an empty match at the beginning of the string.

Following behavior in Java 7 and prior

There is no general solution to make split backward-compatible with Java 7 and prior, short of replacing all instance of split to point to your own custom implementation.

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • Any idea how I can change `split("")` code so that it is consistent across across different java versions? – Daniel Oct 29 '15 at 22:22
  • 2
    @Daniel: It's possible to make it forward-compatible (follow the behavior of Java 8) by adding `(?!^)` to **the end** of the regex and wrap the original regex in non-capturing group `(?:...)` (if necessary), but I can't think of any way to make it backward-compatible (follow the old behavior in Java 7 and prior). – nhahtdh Oct 30 '15 at 03:50
  • Thanks for the explanation. Could you describe `"(?!^)"`? In what scenarios it will be different from `""`? (I am terrible at regex! :-/). – Daniel Oct 31 '15 at 03:44
  • 1
    @Daniel: Its meaning is affected by `Pattern.MULTILINE` flag, while `\A` always matches at the beginning of the string regardless of flags. – nhahtdh Nov 02 '15 at 02:38
31

This has been specified in the documentation of split(String regex, limit).

When there is a positive-width match at the beginning of this string then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.

In "abc".split("") you got a zero-width match at the beginning so the leading empty substring is not included in the resulting array.

However in your second snippet when you split on "a" you got a positive width match (1 in this case), so the empty leading substring is included as expected.

(Removed irrelevant source code)

Alexis C.
  • 91,686
  • 21
  • 171
  • 177
  • 3
    It's just a question. Is it okay to post a fragment of code from the JDK? Remember the copyright problem with Google - Harry Potter - Oracle? – Paul Vargas Mar 28 '14 at 17:24
  • 6
    @PaulVargas To be fair I don't know but I assume it's ok since you can download the JDK, and unzip the src file which contains all the sources. So technically everybody could see the source. – Alexis C. Mar 28 '14 at 17:28
  • 12
    @PaulVargas The "open" in "open source" does stand for something. – Marko Topolnik Mar 28 '14 at 17:49
  • 2
    @ZouZou: just because everybody can see it doesn't mean you can re-publish it – user102008 May 14 '14 at 20:20
  • @user102008 My point was that Oracle itself provides a way to download their sources legally. I'm not a lawyer but I assume that's ok. Anyway that's not the purpose of stackoverflow. – Alexis C. May 14 '14 at 20:24
  • 2
    @Paul Vargas, IANAL but in many other occasions this type of a post falls under quote / fair use situation. More on the topic is here: http://meta.stackexchange.com/questions/12527/do-i-have-to-worry-about-copyright-issues-for-code-posted-on-stack-overflow – Alex Pakka May 16 '14 at 05:16
  • 2
    This quotes the wrong part of the code. Your are only quoting the fast track of Java 8 String.split, where if the delimiter contains 1 non-regex character is processed separately from the rest. – nhahtdh Dec 15 '14 at 04:22
  • @nhahtdh You're right, to be honest I didn't go in depth to the code. +1 for your answer :-) – Alexis C. Dec 15 '14 at 09:02
14

There was a slight change in the docs for split() from Java 7 to Java 8. Specifically, the following statement was added:

When there is a positive-width match at the beginning of this string then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.

(emphasis mine)

The empty string split generates a zero-width match at the beginning, so an empty string is not included at the start of the resulting array in accordance with what is specified above. By contrast, your second example which splits on "a" generates a positive-width match at the start of the string, so an empty string is in fact included at the start of the resulting array.

arshajii
  • 127,459
  • 24
  • 238
  • 287
  • A few more seconds made ​​the difference. – Paul Vargas Mar 28 '14 at 17:11
  • 2
    @PaulVargas actually here arshajii posted answer few seconds before ZouZou, but unfortunately ZouZou answered my question earlier [here](http://stackoverflow.com/questions/22718096/why-a-in-the-0th-index-of-an-array-on-perfoaming-a-split-w-o-delimiters#comment34621049_22718222). I was wondering if I should asked this question since I already knew an answer but it seemed interesting one and ZouZou deserved some reputation for his earlier comment. – Pshemo Mar 28 '14 at 17:13
  • 5
    Despite the new behaviour looks more *logical*, it is obviously a **backward compatibility break**. The only justification for this change is that `"some-string".split("")` is a quite rare case. – ivstas Oct 29 '14 at 06:45
  • 4
    `.split("")` is not the only way to split without matching anything. We used a positive lookahead regex which in jdk7 which also matched at the beginning and produced an empty head element which is now gone. https://github.com/spray/spray/commit/5ab4fdf9ebd8986297e0137bc07088c6223276a0 – jrudolph Feb 10 '15 at 10:59