splitPreserveAllTokens method behavior

Question

        String string = "Hello,,l,, World";
        String test1[] = string.split(",,");

        String test2[] = StringUtils.splitPreserveAllTokens(string , ",,");

test2 has four elements

[Hello, , l, , World]

with two empty elements. Test1 has 3

[Hello, l, World]

which is the expected behavior.

According to the javadoc of splitPreserveAllTokens following is logical

     * StringUtils.splitPreserveAllTokens("::cd:ef", ":")    = ["", "", cd", "ef"]
     * StringUtils.splitPreserveAllTokens(":cd:ef:", ":")    = ["", cd", "ef", ""]

But Still test2 output is not clear to me. Please explain the test2's additional empty elements.

You could look at the source code which might explain things: https://commons.apache.org/proper/commons-lang/apidocs/src-html/org/apache/commons/lang3/StringUtils.html — Tim Biegeleisen, Oct 11 '17 at 07:16
Yep it explains it splites the string by each char of sperator. Still don't make sense for me though — Asiri Liyana Arachchi, Oct 11 '17 at 07:26

score 3 · Answer 1 · answered Oct 11 '17 at 07:17

In the docs it reads:

Adjacent separators are treated as separators for empty tokens.

and

separatorChars - the characters used as the delimiters, null splits on whitespace

meaning it should not make any difference if you use "," or ",," as second argument.

In combination with the first quote and the examples I assume that string beginning and end are as well treated as seperator:

StringUtils.splitPreserveAllTokens(":cd:ef:", ":") One (empty) token between beginning and first colon, one token between the first and the second colon ("cd"), one between the second and third ("ef") and one (again empty) between the last colon and the end of the string leading to the shown result from the docs: ["", "cd", "ef", ""] (With corrected typo).

In your case the second quote above is the more relevant one. ",," is not treated as the seperator but as a set of seperator chars. Meaning ",," is equivalent to "," in this case. And then following the first quote you can explain the result you get:
Beginning of String to first ,: "Hello"
first comma to second one: ""
second comma to third: "l"
thrid to forth: ""
forth to end of the string: " World"

Asiri Liyana Arachchi · Answer 2 · 2017-10-11T10:25:03.880

    String string = "Hello$l$ World";

    String test1[] = string.split("$$");

    String test2[] = StringUtils.splitPreserveAllTokens(string , "$$");

Output:

  Test2  [Hello, l,  World]
  Test1  [Hello$l$ World]

Following is the code for splitPreserveAllTokens

  // standard case
        while (i < len) {
            if (separatorChars.indexOf(str.charAt(i)) >= 0) {
                if (match || preserveAllTokens) {
                    lastMatch = true;
                    if (sizePlus1++ == max) {
                        i = len;
                        lastMatch = false;
                    }
                    list.add(str.substring(start, i));
                    match = false;
                }
                start = ++i;
                continue;
            }
            lastMatch = false;
            match = true;
            i++;
        }
    }

This means that separator chars will be treated as a set of individual separator characters. And whenever any separator character found on the main string it will be splitted.

Advantage using this method over usual split would be

splitPreserveAllTokens method handles null implicitly.

And as mentioned here

in StringUtils uses splitWorker(String str, char separatorChar, boolean preserveAllTokens) , it's own method, which is a Performance tune for 2.0 (JDK1.4). Difference between splitByWholeSeparatorPreserveAllTokens and split

splitPreserveAllTokens method behavior

2 Answers2