0

I'm practicing reading input and then tokenizing it. For example, if I have [882,337] I want to just get the numbers 882 and 337. I tried using the following code:

    String test = "[882,337]";
    String[] tokens = test.split("\\[|\\]|,");
    System.out.println(tokens[0]);
    System.out.println(tokens[1]);
    System.out.println(tokens[2]);

It kind of works, the output is: (blank line) 882 337

What I don't understand is why token[0] is empty? I would expect there to only be two tokens where token[0] = 882 and token[1] = 337.

I checked out some links but didn't find the answer.

Thanks for the help!

SuperCow
  • 1,523
  • 7
  • 20
  • 32

4 Answers4

6

Split splits the given String. If you split "[882,337]" on "[" or "," or "]" then you actually have:

  • nothing
  • 882
  • 337
  • nothing

But, as you have called String.split(delimiter), this calls String.split(delimiter, limit) with a limit of zero.

From the documentation:

The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.

(emphasis mine)

So in this configuration the final, empty, strings are discarded. You are therefore left with exactly what you have.


Usually, to tokenize something like this, one would go for a combination of replaceAll and split:

final String[] tokens = input.replaceAll("^\\[|\\]$").split(",");

This will first strip off the start (^[) and end (]$) brackets and then split on ,. This way you don't have to have somewhat obtuse program logic where you start looping from an arbitrary index.


As an alternative, for more complex tokenizations, one can use Pattern - might be overkill here, but worth bearing in mind before you get into writing multiple replaceAll chains.

First we need to define, in Regex, the tokens we want (rather than those we're splitting on) - in this case it's simple, it's just digits so \d.

So, in order to extract all digit only (no thousands/decimal separators) values from an arbitrary String on would do the following:

final List<Integer> tokens = new ArrayList<>();    <-- to hold the tokens
final Pattern pattern = Pattern.compile("\\d++");  <-- the compiled regex
final Matcher matcher = pattern.matcher(input);    <-- the matcher on input

while(matcher.find()) {                            <-- for each matched token
    tokens.add(Integer.parseInt(matcher.group())); <-- parse and `int` and store
}

N.B: I have used a possessive regex pattern for efficiency

So, you see, the above code is somewhat more complex than the simple replaceAll().split(), but it is much more extensible. You can use arbitrary complex regex to token almost any input.

Community
  • 1
  • 1
Boris the Spider
  • 59,842
  • 6
  • 106
  • 166
  • Thank you for your explanation. Makes perfect sense. When doing something with my array after the String has been tokenized would I just start at index [1] in order to ignore the preceding empty sting at [0], or is there a better way to handle it? – SuperCow Jan 29 '16 at 05:16
  • @SuperCow see my edit. Usually one would `replaceAll` the tokens that you didn't want and `split` on the delimiters. Splitting on start and end tokens is, indeed, a little messy. – Boris the Spider Jan 29 '16 at 09:05
3

The symbols where the string is split are here:

String test = "[882,337]";
               ^   ^   ^

Because The first char matches your delimiter, everything left from it will be the first result. Well, left from the first letter is nothing, so the result is the empty string.

One could expect the same behaviour for the end, since the last symbol also matches the delimiter. But:

Trailing empty strings are therefore not included in the resulting array.

See Javadoc.

exception1
  • 1,239
  • 8
  • 17
  • Ah, interesting. Thank you for explaining that. When dealing with such a case would I just start from index 1 when doing something with my array in order to ignore the preceding empty string or is there a more elegant way of doing it? – SuperCow Jan 29 '16 at 05:14
  • @SuperCow If you definitely know that your delimiter matches your string at position zero (resulting in an empty string at array index 0), starting with index 1 is just fine. – exception1 Jan 29 '16 at 23:37
2

That's because each delimiter has a "before" and "after" result, even if it is empty. Consider

882,337

You expect that to produce two results. Similarly, you expect

882,337,

to produce three, with the last one being empty (assuming your limit is big enough, or assuming you're using almost any other language / implementation of split()). Extending that logically,

,882,337,

must produce four, with the first and last results being empty. This is exactly the case you have, except you have multiple delimiters.

Paul Hicks
  • 13,289
  • 5
  • 51
  • 78
2

Splitting creates two (or more) things from one thing. For instance if you split a,b by , you will get a and b.

But in case of ",b" you will get "" and "b". You can think of it this way: "" exists at start, end and even in-between all characters of string:

""+","+"b" -> ",b" so if we split on this "," we are getting left and right part: "" and "b"


Similar things happens in case of "a," and at first result array is ["a",""] but here split method removes trailing empty strings and returns only ["a"] (you can turn off this clearing mechanism by using split(",", -1)).

So in case of

String test = "[882,337]";
String[] tokens = test.split("\\[|\\]|,");

you are splitting:

     ""+"["+"882"+","+"337"+"]"+""
here:    ^         ^         ^

which at first creates array ["", "882", "337", ""] but then trailing empty string is removed and finally you are receiving:

["", "882", "337"]

Only case where empty string is removed from start of result array is when

Community
  • 1
  • 1
Pshemo
  • 122,468
  • 25
  • 185
  • 269