Split happens in places which matches regex passed as argument. You need to know that if split happens ONE thing becomes TWO things. Always. There is no exception.
You can doubt it because of instance "abc".split("c")
returns array with one element ["ab"]
but that is because this version of split
also automatically removes trailing empty strings from array before returning it.
In other words "abc".split("c")
- creates
["ab",""]
array (yes there is empty string which is result of splitting "abc"
on c
),
- removes trailing empty strings
- returns as result array without those empty strings at the end so now it returns
["ab"]
Another example would be splitting "abc"
on "a"
. Since a
is present at start you will get ["", "bc"]
.
But splitting on empty String is little bit more tricky, because empty string is before and after each characters. I will mark them using pipe |
.
So empty Strings in "abc"
can be found at these positions "|a|b|c|"
which means that when you split "abc"
on ""
- this method (at first) produces array
["", "a", "b", "c", ""]
- and later removes trailing empty strings
That is why "abc".split("")
returns as result array ["", "a", "b", "c"]
(this should answer your question 1).
But what if we want to prevent first empty string (the one at start) from being matched by split method? In other words what if we don't want to split on
"|a|b|c|"
but only on
"a|b|c|"
We can do it in few ways.
- We can try to create regex which will match these whatspaces which have any character before them like
a|
b|
c|
.
- We can also say that we want to split on whatspaces that do not have start of string before them.
To create such regexes we will need look-around mechanisms.
- To say empty Stirng just use
""
- To say that something needs to have something else before it we can use positive-look-behind
(?<=.)
.
If we will combine previous two pints: "(?<=.)"
and ""
we will get "(?<=.)"+""
which is simply "(?<=.)"
so "abc".split("(?<=.)")
should split only on these empty strings which are preceded by any character (in regex represented by dot .
).
To say that something can't stay at start of the string we can use negative-look-behind (?<!...)
and ^
which represents start of the string. So (?<!^)
represents condition "has no beginning of string before it". That is why "(?<!^)
cant match this white space
↓
"|a|b|c|"
since it has start of the string before it.
Actually there is also one special case which is main point of your question (?!^)
which means negative-look-ahead. This regex describes empty string which do not have start of the string after it. It is kind of unintuitive, because previously we assumed that start of the string (represented by ^
) is placed here
↓
"^|a|b|c|"
but now it looks like it is here:
↓
"|^a|b|c|"
So what is going on? How does it works?
As I told earlier splitting on empty strings is tricky. To understand this you need to take a look at string without marked empty strings and you will see that start of the string is here
↓
"^abc"
In other words, regex also considers place right before first character (in our case "a"
) as its start, so
↓
"|^a|b|c|"
makes also sense and is valid, which is why (?!^)
is able to see this empty string
↓
"|^a|b|c|"
as right before start of the string and will not accept it as valid place to split.
ANYWAY Since this was causing confusion for developers who ware not very familiar with regex, from Java 8 we don't have to use trick with (?<=.)
or (?<!^)
or (?!^)
to avoid creating empty string at the beginning, because as described in this question
Why in Java 8 split sometimes removes empty strings at start of result array?
it automatically removes empty string at start of generated array as long regex used in split
represents zero-length string (like empty string), so you now will be able to use "abc".split("")
and get as result ["a", "b", "c"]
.