0

I am trying to learn regular expressions and got confused. I saw this post java split () method
so I have some questions regarding to the 2nd answer by Achintya Jha;

  1. why does str2.split(""); give out as [, 1, 2, 3]
  2. does it detect "" at the start of the text and if so, why doesn't it do the same at the end?
  3. what exactly (?!^) means?

If I am not wrong a(?!b) returns a if a is not followed by b
and ^ finds regex that must match at the beginning of the line so (?!^) gets an empty string ""
and ^ finds "" that must match at the beginning of the line so returns "" if this "" is not followed by ""?

Community
  • 1
  • 1
xDeathwing
  • 139
  • 13

3 Answers3

3

Split happens in places which matches regex passed as argument. You need to know that if split happens ONE thing becomes TWO things. Always. There is no exception.

You can doubt it because of instance "abc".split("c") returns array with one element ["ab"] but that is because this version of split also automatically removes trailing empty strings from array before returning it.

In other words "abc".split("c")

  1. creates ["ab",""] array (yes there is empty string which is result of splitting "abc" on c),
  2. removes trailing empty strings
  3. returns as result array without those empty strings at the end so now it returns ["ab"]

Another example would be splitting "abc" on "a". Since a is present at start you will get ["", "bc"].

But splitting on empty String is little bit more tricky, because empty string is before and after each characters. I will mark them using pipe |.

So empty Strings in "abc" can be found at these positions "|a|b|c|" which means that when you split "abc" on ""

  • this method (at first) produces array ["", "a", "b", "c", ""]
  • and later removes trailing empty strings

That is why "abc".split("") returns as result array ["", "a", "b", "c"] (this should answer your question 1).

But what if we want to prevent first empty string (the one at start) from being matched by split method? In other words what if we don't want to split on

"|a|b|c|"

but only on

 "a|b|c|"

We can do it in few ways.

  1. We can try to create regex which will match these whatspaces which have any character before them like a| b| c|.
  2. We can also say that we want to split on whatspaces that do not have start of string before them.

To create such regexes we will need look-around mechanisms.

    • To say empty Stirng just use ""
    • To say that something needs to have something else before it we can use positive-look-behind (?<=.).

    If we will combine previous two pints: "(?<=.)" and "" we will get "(?<=.)"+"" which is simply "(?<=.)" so "abc".split("(?<=.)") should split only on these empty strings which are preceded by any character (in regex represented by dot .).

  1. To say that something can't stay at start of the string we can use negative-look-behind (?<!...) and ^ which represents start of the string. So (?<!^) represents condition "has no beginning of string before it". That is why "(?<!^) cant match this white space

     ↓  
    "|a|b|c|"
    

since it has start of the string before it.

Actually there is also one special case which is main point of your question (?!^) which means negative-look-ahead. This regex describes empty string which do not have start of the string after it. It is kind of unintuitive, because previously we assumed that start of the string (represented by ^) is placed here

 ↓
"^|a|b|c|"

but now it looks like it is here:

  ↓
"|^a|b|c|"

So what is going on? How does it works?
As I told earlier splitting on empty strings is tricky. To understand this you need to take a look at string without marked empty strings and you will see that start of the string is here

 ↓
"^abc"

In other words, regex also considers place right before first character (in our case "a") as its start, so

  ↓
"|^a|b|c|"

makes also sense and is valid, which is why (?!^) is able to see this empty string

 ↓
"|^a|b|c|"

as right before start of the string and will not accept it as valid place to split.


ANYWAY Since this was causing confusion for developers who ware not very familiar with regex, from Java 8 we don't have to use trick with (?<=.) or (?<!^) or (?!^) to avoid creating empty string at the beginning, because as described in this question

Why in Java 8 split sometimes removes empty strings at start of result array?

it automatically removes empty string at start of generated array as long regex used in split represents zero-length string (like empty string), so you now will be able to use "abc".split("") and get as result ["a", "b", "c"].

Community
  • 1
  • 1
Pshemo
  • 122,468
  • 25
  • 185
  • 269
2

(1) Why does str2.split(""); give out as [, 1, 2, 3] (2) Does it detect "" at the start of the text and if so why doesnt it do the same at the end?`

By splitting an empty string it will return the empty string as the first item. If no delimiter is defined in the string you are searching, you will get an array of size 1 which holds the original string even if it is empty.


(3) What exactly does (?!^) mean?

This is a Negative Lookahead assertion which asserts that it is not positioned before/at the start of the string.

(?!   # look ahead to see if there is not:
  ^   #   the beginning of the string
)     # end of look-ahead

And you are correct on how Negative Lookahead works.

a(?!b) # matches a when not followed by b
hwnd
  • 69,796
  • 4
  • 95
  • 132
1

The regex:

(?!^)

Is a negative look ahead for start of input. It means "not positioned before the start of input".

Because the otherwise blank regex matches before the start, this assertion stops it splitting there, so it only splits between characters (not between start and the first character).

Another regex that achieves the same thing would be:

(?<=.)

Which is a look behind for any character, ie "after any character", which I find clearer.

Bohemian
  • 412,405
  • 93
  • 575
  • 722