15

I was doing a question out of the book oracle_certified_professional_java_se_7_programmer_exams_1z0-804_and_1z0-805 by Ganesh and Sharma.

One question is:

  1. Consider the following program and predict the output:

      class Test {
    
        public static void main(String args[]) {
          String test = "I am preparing for OCPJP";
          String[] tokens = test.split("\\S");
          System.out.println(tokens.length);
        }
      }
    

    a) 0

    b) 5

    c) 12

    d) 16

Now I understand that \S is a regex means treat non-space chars as the delimiters. But I was puzzled as to how the regex expression does its matching and what are the actual tokens produced by split.

I added code to print out the tokens as follows

for (String str: tokens){
  System.out.println("<" + str + ">");
}

and I got the following output

16

<>

< >

<>

< >

<>

<>

<>

<>

<>

<>

<>

<>

< >

<>

<>

< >

So a lot of empty string tokens. I just do not understand this.

I would have thought along the lines that if delimiters are non space chars that in the above text then all alphabetic chars serve as delimiters so maybe there should be 21 tokens if we are matching tokens that result in empty strings too. I just don't understand how Java's regex engine is working this out. Are there any regex gurus out there who can shed light on this code for me?

Duncan Jones
  • 67,400
  • 29
  • 193
  • 254
Frank Brosnan
  • 231
  • 1
  • 2
  • 8
  • 1
    I tried your example and it makes much more sense if you replace \\S with \\s, could this be a typo ? – mreiterer Oct 09 '14 at 14:32
  • 2
    @mreiterer This is for a certification exam, why would it seem strange that they throw in a tricky case like this? The fact that they included the correct answer (16) as one of the choices makes it very unlikely that this was unintentional. – ajb Oct 09 '14 at 14:49
  • P.S. If 21 had been one of the choices, I probably would have gotten this wrong. – ajb Oct 09 '14 at 14:51
  • Hi no it was meant to be \\S the opposite of \\s. Tricky one this. – Frank Brosnan Oct 09 '14 at 14:54

3 Answers3

12

Copied from the API documentation: (bold are mine)

public String[] split(String regex)

Splits this string around matches of the given regular expression. This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.

The string "boo:and:foo", for example, yields the following results with these expressions:

 Regex  Result
   :    { "boo", "and", "foo" }
   o    { "b", "", ":and:f" }

Check the second example, where last 2 "o" are just removed: the answer for your question is "OCPJP" substring is treated as a collection of separators which is not followed for non-empty strings, so that part is trimmed.

Pablo Lozano
  • 10,122
  • 2
  • 38
  • 59
  • Thanks Pablo that makes sense if you ignore the empty strings after the last space. That would explain the number. 16 instead of 21 ish. – Frank Brosnan Oct 09 '14 at 14:51
  • This is on a slightly different point but say you had a comma seperated file with the values at the end empty say they were not filled in say its from an excel spreadsheet where the user did not enter a value. Would this mean that String.split would throw them away. Might lead to nasty bugs if you were expecting to processing the data. Just thinking aloud :-). – Frank Brosnan Oct 09 '14 at 14:57
  • Yes, that's the reason you have to check the length of the array when splitting a CSV line. If you mix that and the fact that CSV format lacks any standard... – Pablo Lozano Oct 09 '14 at 15:04
  • 1
    @FrankBrosnan In that case you may want to consider `split(",", -1)`. – ntoskrnl Oct 09 '14 at 15:43
8

The reason the result is 16 and not 21 is this, from the javadoc for Split:

Trailing empty strings are therefore not included in the resulting array.

This means, for example, that if you say

"/abc//def/ghi///".split("/")

the result will have five elements. The first will be "", since it's not a trailing empty string; the others will be "abc", "", "def", and "ghi". But the remaining empty strings are removed from the array.

In the posted case:

"I am preparing for OCPJP".split("\\S")

it's the same thing. Since non-space characters are delimiters, each letter is a delimiter, but the OCPJP letters essentially don't count, because those delimiters result in trailing empty strings that are then discarded. So, since there are 15 letters in "I am preparing for", they are treated as delimiting 16 substrings (the first is "" and the last is " ").

ajb
  • 31,309
  • 3
  • 58
  • 84
7

First things start with \s (lower case), which is a regular expression character class for white space, that is space ' ' tabs '\t', new line chars '\n' and '\r', vertical tab '\v' and a bunch of other characters.

\S (upper case) is the opposite of this, so that would mean any non white space character.

So when you split this String "I am preparing for OCPJP" using \S you are effectively splitting the string at every letter. The reason your token array has a length of 16.

Now as for why these are empty.

Consider the following String: Hello,World, if we were to split that using ,, we would end up with a String array of length 2, with the following contents: Hello and World. Notice that the , is not in either of the Strings, it has be erased.

The same thing has happened with the I am preparing for OCPJP String, it has been split, and the points matched by your regex are not in any of the returned values. And because most of the letters in that String are followed by another letter, you end up with a load of Strings of length zero, only the white space characters are preserved.

PeterK
  • 1,697
  • 10
  • 20
  • 5
    The point of the questions is: why 16 and not 21? Why is "OCPJP" not treated as a bunch of separators? There are 21 letters, but last ones are ignored... – Pablo Lozano Oct 09 '14 at 14:38
  • Fair point, missed that part of the question! Thanks for pointing that out and highlighting the documentation in your answer. – PeterK Oct 09 '14 at 14:49