1

When I use the String.split() method, how come sometimes I get empty strings? For example, if I do:

"(something)".split("\\W+")  

Then the first element of the return value will be an empty string. Also, the example from the documentation (as seen here) doesn't make sense either.

Regex          Result

  :    { "boo", "and", "foo" }}
  o    { "b", "", ":and:f" }}

How come the ":" is used as the delimiter, there are no empty strings, but with "o" there are?

paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
b_pcakes
  • 2,452
  • 3
  • 28
  • 45
  • 6
    Because there are two consecutive `o`s so when you split it, you have an empty String between. – Tunaki Sep 24 '15 at 08:31
  • 8
    The rule of thumb is that `split()` returns an array of `String`s that you can join back together and get the original string if you know the delimiter. So if the original string starts with the delimiter, the result will start with an empty string, if the original has two consecutive delimiters, the result will contain an empty string there and so on. – biziclop Sep 24 '15 at 08:34
  • See http://stackoverflow.com/questions/145509/why-does-abcd-startswith-return-true The point is the empty strings are located at the beginning, end and between each symbol. `\W` matches the start and end of string, and non-word characters. – Wiktor Stribiżew Sep 24 '15 at 08:44

2 Answers2

2

With:

"(something)".split("\\W+")

it's assuming the delimiter comes between fields, so what you end up with is:

""   "something"   ""    <- fields
   (             )       <- delimiters

You could fix that by trimming the string first to remove any leading or trailing delimiters, something like:

"(something)".replaceAll("^\\W*","").replaceAll("\\W*$","").split("\\W+")

With something like:

"boo:and:foo".split("o", 0)

you'll get:

"b"   ""   ":and:f"   <- fields
    o    o            <- delimiters

because you have consecutive delimiters (which don't exists when the delimiter is ":") which are deemed therefore to have an empty field between them.

And the reason you don't have trailing blank fields because of foo at the end, has to do with that limit of zero. In that case, trailing (not leading) empty fields are removed.

If you want to also get rid of the empty fields in the middle, you can instead use "o+" as the delimiter since that will greedily absorb consective o characters into a single delimiter. You can also use the replaceAll trick shown above to get rid of leading empty fields.

paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
0

Actually the reason is not in which delimiter you choose, in the latter case you have two os following one by one. And what is between them? The empty string is.

Maybe it's contrintuitive in the beginning and you might think it would be better to skip empty strings. But there are two very popular formats to store data in text file. Tab separated values and comma separated values.

Let's imagine that you want to store information about people in format name,surname,age. For example Peter,Green,12. But what if you want to store information about the guy whose surname you don't know. It should look like Mike,,13. Then if you split by comma you get 'Mike', '', '13' and you know that the first element is name, the second is empty surname and the third is age. But if you choose to skip empty strings then you'll get 'Mike', '13'. And you cannot understand which field is missing.

sbeliakov
  • 2,169
  • 1
  • 20
  • 37