52

I do not understand the output of this code:

public class StringDemo{              
    public static void main(String args[]) {
        String blank = "";                    
        String comma = ",";                   
        System.out.println("Output1: "+blank.split(",").length);  
        System.out.println("Output2: "+comma.split(",").length);  
    }
}

And got the following output:

Output1: 1 
Output2: 0
Simon MᶜKenzie
  • 8,344
  • 13
  • 50
  • 77
sanket patel
  • 489
  • 1
  • 4
  • 7
  • 7
    What do you not understand about it? – Raedwald Jul 31 '14 at 11:42
  • 13
    @Raedwald Confusing part was that `",".split(",")` could return `["",""]` array but it returns `[]` (empty array - length 0 - because `split(",",0)` trails empty Strings at the end). So why empty string in result array was not trailed in case of `"",split(",")`? – Pshemo Jul 31 '14 at 11:52
  • 3
    The weirdness of `String.split` is exactly why the Guava library has `Splitter`, as [explained in the Guava documentation](https://code.google.com/p/guava-libraries/wiki/StringsExplained#Splitter) – Daniel Pryden Jul 31 '14 at 20:24

8 Answers8

55

Documentation:

For: System.out.println("Output1: "+blank.split(",").length);

The array returned by this method contains each substring of this string that is terminated by another substring that matches the given expression or is terminated by the end of the string. The substrings in the array are in the order in which they occur in this string. If the expression does not match any part of the input then the resulting array has just one element, namely this string.

It will simply return the entire string that's why it returns 1.


For the second case, String.split will discard the , so the result will be empty.

String.split silently discards trailing separators

see guava StringsExplained too

Marco Acierno
  • 14,682
  • 8
  • 43
  • 53
  • 13
    The Javadoc of the one-argument split method says: "This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. **Trailing empty strings are therefore not included in the resulting array.**" That's the correct explanation of the second result. Two trailing empty strings get excluded. – COME FROM Jul 31 '14 at 12:07
  • 6
    Yeah, in theory everything is in doc. But I always wonder from where they are getting those guys that you can read 10 times what they've written, and yet still you have to write a test program to understand what that method is actually doing... – Danubian Sailor Jul 31 '14 at 12:16
34

Everything happens according to plan, but let's do it step by step (I hope you have some time).

According to documentation (and source code) of split(String regex) method:

This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero.

So when you invoke

split(String regex)

you are actually getting result from the split(String regex, int limit) method which is invoked in a way:

split(regex, 0)

So here limit is set to 0.

You need to know a few things about this parameter:

  • If limit is positive you are limiting length of result array to a positive number you specified, so "axaxaxaxa".split("x",2) will return an array, ["a", "axaxaxa"], not ["a","a","a","a","a"].
  • If limit is 0 then you are not limiting the length of the result array. But it also means that any trailing empty strings will be removed. For example:

    "fooXbarX".split("X")
    

    will at start generate an array which will look like:

    ["foo", "bar", ""]
    

    ("barX" split on "X" generates "bar" and ""), but since split removes all trailing empty string, it will return

    ["foo", "bar"]
    
  • Behaviour of negative value of limit is similar to behaviour where limit is set to 0 (it will not limit length of result array). The only difference is that it will not remove empty strings from the end of the result array. In other words

    "fooXbarX".split("X",-1)
    

will return ["foo", "bar", ""]


Lets take a look at the case,

",".split(",").length

which (as explained earlier) is same as

",".split(",", 0).length

This means that we are using a version of split which will not limit the length of the result array, but will remove all trailing empty strings, "". You need to understand that when we split one thing we are always getting two things.

In other words, if we split "abc" in place of b, we will get "a" and "c".
The tricky part is to understand that if we split "abc" in c we will get "ab" and "" (empty string).

Using this logic, if we split "," on , we will get "" and "" (two empty strings).

You can check it using split with negative limit:

for (String s: ",".split(",", -1)){
    System.out.println("\""+s+"\"");
}

which will print

""
""

So as we see result array here is at first ["", ""].

But since by default we are using limit set to 0, all trailing empty strings will be removed. In this case, the result array contains only trailing empty strings, so all of them will be removed, leaving you with empty array [] which has length 0.


To answer the case with

"".split(",").length

you need to understand that removing trailing empty strings makes sense only if such trailing empty strings ware result of splitting (and most probably are not needed).
So if there were not any places on which we could split, there is no chance that empty strings ware created, so there is no point in running this "cleaning" process.

This information is mentioned in documentation of split(String regex, int limit) method where you can read:

If the expression does not match any part of the input then the resulting array has just one element, namely this string.

You can also see this behaviour in source code of this method (from Java 8):

2316      public String[] split(String regex, int limit) {
2317 /* fastpath if the regex is a
2318 (1)one-char String and this character is not one of the
2319 RegEx's meta characters ".$|()[{^?*+\\", or
2320 (2)two-char String and the first char is the backslash and
2321 the second is not the ascii digit or ascii letter.
2322 */
2323 char ch = 0;
2324 if (((regex.value.length == 1 &&
2325 ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
2326 (regex.length() == 2 &&
2327 regex.charAt(0) == '\\' &&
2328 (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
2329 ((ch-'a')|('z'-ch)) < 0 &&
2330 ((ch-'A')|('Z'-ch)) < 0)) &&
2331 (ch < Character.MIN_HIGH_SURROGATE ||
2332 ch > Character.MAX_LOW_SURROGATE))
2333 {
2334 int off = 0;
2335 int next = 0;
2336 boolean limited = limit > 0;
2337 ArrayList<String> list = new ArrayList<>();
2338 while ((next = indexOf(ch, off)) != -1) {
2339 if (!limited || list.size() < limit - 1) {
2340 list.add(substring(off, next));
2341 off = next + 1;
2342 } else { // last one
2343 //assert (list.size() == limit - 1);
2344 list.add(substring(off, value.length));
2345 off = value.length;
2346 break;
2347 }
2348 }
2349 // If no match was found, return this
2350 if (off == 0)
2351 return new String[]{this};
2353 // Add remaining segment
2354 if (!limited || list.size() < limit)
2355 list.add(substring(off, value.length));
2357 // Construct result
2358 int resultSize = list.size();
2359 if (limit == 0) {
2360 while (resultSize > 0 && list.get(resultSize - 1).length() == 0) {
2361 resultSize--;
2362 }
2363 }
2364 String[] result = new String[resultSize];
2365 return list.subList(0, resultSize).toArray(result);
2366 }
2367 return Pattern.compile(regex).split(this, limit);
2368 }

where you can find

if (off == 0)
    return new String[]{this};

fragment which means

  • if (off == 0) - if off (position from which method should start searching for next possible match for regex passed as split argument) is still 0 after iterating over entire string, we didn't find any match, so the string was not split
  • return new String[]{this}; - in that case let's just return an array with original string (represented by this).

Since "," couldn't be found in "" even once, "".split(",") must return an array with one element (empty string on which you invoked split). This means that the length of this array is 1.

BTW. Java 8 introduced another mechanism. It removes leading empty strings (if they ware created while splitting process) if we split using zero-length regex (like "" or with look-around (?<!x)). More info at: Why in Java 8 split sometimes removes empty strings at start of result array?

Community
  • 1
  • 1
Pshemo
  • 122,468
  • 25
  • 185
  • 269
  • Sorry about the off-topic comment, but may I ask how you generated that code block with line numbers and formatting? – Bob Aug 01 '14 at 15:35
  • 2
    @Bob When you hover your mouse over line number at grepcode you would see `<>`. When you click it you will open box in which you can specify range of lines you want to get as HTML code. – Pshemo Aug 01 '14 at 15:39
  • Ah, a bit unfortunate that it's grepcode-specific, but still pretty nice. Thanks. – Bob Aug 01 '14 at 15:43
  • @Pshemo **1)** Then why `"".split("").length` is `1`. How is it different from `",".split(",").length` ? **2)** `" ".split("").length` is `1` as regex `""` cannot be found in `" "`. Seems OK. **3)** But how come `" ".split("").length` is `2`? **4)** Also how come both `"a ".split("").length` and `"a a".split("").length` is `3`? In this case, why `""` is matching in between and not the start. If it also matches at start length would be `4`. **5)** What I want to know is, when should I match regex `""` with the given string and when shouldn't I match? – AnV Jan 03 '17 at 14:32
  • @Pshemo **5)** Also `" ".split(" ").length` is `0`. regex `" "` was found in string `" "` , all trailing were removed and hence. Seems OK. **6)** But how come `" ".split(" ").length` is `0`? **a)** why regex `" "` (1 space) cannot be found in `" "`(2 spaces). **b)** Even if there is no match, shouldn't the length be `1`? because, entire string `" "` should have got stored as one element in that array. **7)** And `"a ".split(" ").length` is `1` but array has `[a]`. `" a".split(" ").length` is `3` but array has `[, , a]`. Very confusing. – AnV Jan 03 '17 at 15:16
  • @Pshemo **8)** When should I match regex `" "` with the given string and when shouldn't I match? – AnV Jan 03 '17 at 15:16
  • 1
    @AbhinavVutukuri To answer your questions I would need more than one comment. Could you post these examples as separate question (or questions)? Also it may be important to point out what version of Java you are using. Judging by your profile picture it may be Android which can be using Java 7 instead of Java 8 where you can get little different results. – Pshemo Jan 03 '17 at 15:40
  • 1
    @AbhinavVutukuri Anyway in short, you can think that Java assumes that you can't split `""` farther, so for each `"".split(whatever)` you will always get `[""]` array. In case of `",".split(",")` regex matches entire string so at first you are getting `["", ""]` array which then removes trailing empty strings, leaving empty array so its length is `0` not `2` (I don't know where did you get that value from). `" ".split("")` in Java 8 gives me `[" "]`. Originally it was `["", " ", ""]` - empty string exists at start and end of string. Trailing empty string ware removed, leading in Java 8. – Pshemo Jan 03 '17 at 15:51
  • @Pshemo I have posted a new question with full details. link: http://stackoverflow.com/q/41449791/2818583. I am using **Java 8 u112**. All code was complied and ran using intellij idea software: https://www.jetbrains.com/idea/. In case of `",".split(",")` I also got length as '0'. **2)** in my comment signifies **question 2**. Not length of the array. I was asking how `"".split("").` is different from `",".split(",")`. You explained that clearly in your comment. Thanks. – AnV Jan 03 '17 at 18:20
7

From the Java 1.7 Documentation

Splits the string around matches of the given regular expression.

split() method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.

In the Case 1 blank.split(",") does not match any part of the input then the resulting array has just one element, namely this String.

It will return entire String. So, the length will be 1.

In the Case 2 comma.split(",") will return empty.

split() expecting a regex as argument, return result array to matching with that regex.

So, the length is 0

For Example(Documentation)

The string "boo:and:foo", yields the following results with these expressions:

Regex     Result
  :     { "boo", "and", "foo" }
  o     { "b", "", ":and:f" }

Parameters: regex - the delimiting regular expression

Returns: the array of strings computed by splitting this string around matches of the given regular expression

Throws: PatternSyntaxException - if the regular expression's syntax is invalid

Naveen Kumar Alone
  • 7,536
  • 5
  • 36
  • 57
4

From String class javadoc for the public String[] split(String regex) method:

Splits this string around matches of the given regular expression.

This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.

In the first case, the expression does not match any part of the input so we got an array with only one element - the input.

In the second case, the expression matches input and split should return two empty strings; but, according to javadoc, they are discarded (because they are trailing and empty).

Community
  • 1
  • 1
Ivan Nikolaev
  • 420
  • 2
  • 6
3

We can take a look into the source code of java.util.regex.Pattern which is behind String.split. Way down the rabbit hole the method

public String[] split(CharSequence input, int limit)

is invoked.

Input ""

For input "" this method is called as

String[] parts = split("", 0);

The intersting part of this method is:

  int index = 0;
  boolean matchLimited = limit > 0;
  ArrayList<String> matchList = new ArrayList<>();
  Matcher m = matcher(input);

  while(m.find()) {
    // Tichodroma: this will not happen for our input
  }

  // If no match was found, return this
  if (index == 0)
    return new String[] {input.toString()};

And that is what happens: new String[] {input.toString()} is returned.

Input ","

For input ","the intersting part is

    // Construct result
    int resultSize = matchList.size();
    if (limit == 0)
        while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
            resultSize--;
    String[] result = new String[resultSize];
    return matchList.subList(0, resultSize).toArray(result);

Here resultSize == 0 and limit == 0 so new String[0] is returned.

  • I believe that your last sentence is an oversimplification, so much so that it cripples the value of your answer. The ***interesting*** (i.e., *relevant*) part is lines 1223-1225. Entering line 1223, `resultSize` is `2`, because `matchList` is { `""`, `""` }. But, *because* `limit` is `0` (the default when `split` is called with only one parameter), the loop at lines 1224-1225 gets invoked, and it iterates twice, discarding the two null strings and decrementing `resultSize` to `0`. – Scott - Слава Україні Jul 31 '14 at 18:23
2

From JDK 1.7

 public String[] split(String regex, int limit) {
        /* fastpath if the regex is a
           (1)one-char String and this character is not one of the
              RegEx's meta characters ".$|()[{^?*+\\", or
           (2)two-char String and the first char is the backslash and
              the second is not the ascii digit or ascii letter.
        */
        char ch = 0;
        if (((regex.count == 1 &&
             ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
             (regex.length() == 2 &&
              regex.charAt(0) == '\\' &&
              (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
              ((ch-'a')|('z'-ch)) < 0 &&
              ((ch-'A')|('Z'-ch)) < 0)) &&
            (ch < Character.MIN_HIGH_SURROGATE ||
             ch > Character.MAX_LOW_SURROGATE))
        {
            int off = 0;
            int next = 0;
            boolean limited = limit > 0;
            ArrayList<String> list = new ArrayList<>();
            while ((next = indexOf(ch, off)) != -1) {
                if (!limited || list.size() < limit - 1) {
                    list.add(substring(off, next));
                    off = next + 1;
                } else {    // last one
                    //assert (list.size() == limit - 1);
                    list.add(substring(off, count));
                    off = count;
                    break;
                }
            }
            // If no match was found, return this
            if (off == 0)
                return new String[] { this };

            // Add remaining segment
            if (!limited || list.size() < limit)
                list.add(substring(off, count));

            // Construct result
            int resultSize = list.size();
            if (limit == 0)
                while (resultSize > 0 && list.get(resultSize-1).length() == 0)
                    resultSize--;
            String[] result = new String[resultSize];
            return list.subList(0, resultSize).toArray(result);
        }
        return Pattern.compile(regex).split(this, limit);
    }

So for this case, the regex will be handled by the first if.

For the first case blank.split(",")

// If no match was found, return this
if (off == 0)
   return new String[] { this };

So, this function will return an array which contains one element if there is no matched.

For the second case comma.split(",")

List<String> list = new ArrayList<>();
//...
int resultSize = list.size();
if (limit == 0)
    while (resultSize > 0 && list.get(resultSize-1).length() == 0)
           resultSize--;
String[] result = new String[resultSize];
return list.subList(0, resultSize).toArray(result);

As you notice, the last while loop has removed all empty element in the end of the list, so the resultSize is 0.

EpicPandaForce
  • 79,669
  • 27
  • 256
  • 428
Pham Trung
  • 11,204
  • 2
  • 24
  • 43
1
String blank = "";                    
String comma = ",";                   
System.out.println("Output1: "+blank.split(",").length);  // case 1
System.out.println("Output2: "+comma.split(",").length);  // case 2

case 1 - Here blank.split(",") will return "" since there is no , in blank you get the same, So length will be 1

case 2- Here comma.split(",") will return empty array, you have to scape , if you want to count comma with length 1 else length will be 0

Again comma.split(",") split() expecting a regex as argument it will return result array to matching with that regex.

The array returned by this method contains each substring of this string that is terminated by another substring that matches the given expression or is terminated by the end of the string.

Else

If the expression does not match any part of the input then the resulting array has just one element, namely this string.

Ruchira Gayan Ranaweera
  • 34,993
  • 17
  • 75
  • 115
1

The API for the split method states that "If the expression does not match any part of the input then the resulting array has just one element, namely this string."

So, as the String blank doesn't contain a ",", a String[] with one element (i.e. blank itself) is returned.

For the String comma, "nothing" is left of the original string thus an empty array is returned.

This seems to be the best solution if you want to process the returned result, e. g.

String[] splits = aString.split(",");
for(String split: splits) {
   // do something
}
Ralf Wagner
  • 1,467
  • 11
  • 19