10

I like to do code-golfing in Java (even though Java way too verbose to be competitive), which is completing a certain challenge in as few bytes as possible. In one of my answers I had the following piece of code:

for(var p:"A4;B8;CU;EM;EW;E3;G6;G9;I1;L7;NZ;O0;R2;S5".split(";"))

Which basically loops over the 2-char Strings after we converted it into a String-array with .split. Someone suggested I could golf it to this instead to save 4 bytes:

for(var p:"A4B8CUEMEWE3G6G9I1L7NZO0R2S5".split("(?<=\\G..)"))

The functionality is still the same. It loops over the 2-char Strings.

However, neither of us was 100% sure how this works, hence this question.


What I know:

I know .split("(?<= ... )") is used to split, but keep the trailing delimiter.
There is also a way to keep a leading delimiter, or delimiter as separated item:

"a;b;c;d".split("(?<=;)")            // Results in ["a;", "b;", "c;", "d"]
"a;b;c;d".split("(?=;)")             // Results in ["a", ";b", ";c", ";d"]
"a;b;c;d".split("((?<=;)|(?=;))")    // Results in ["a", ";", "b", ";", "c", ";", "d"]

I know \G is used to stop after a non-match is encountered.
EDIT: \G is used to indicate the position where the last match ended (or the start of the string for the first run). Corrected definition thanks to @SebastianProske.

int count = 0;
java.util.regex.Pattern pattern = java.util.regex.Pattern.compile("match,");
java.util.regex.Matcher matcher = pattern.matcher("match,match,match,blabla,match,match,");
while(matcher.find())
  count++;
System.out.println(count); // Results in 5

count = 0;
pattern = java.util.regex.Pattern.compile("\\Gmatch,");
matcher = pattern.matcher("match,match,match,blabla,match,match,");
while(matcher.find())
  count++;
System.out.println(count); // Results in 3

But how does .split("(?<=\\G..)") work exactly when using \G inside the split?
And why does .split("(?=\\G..)") not work?

Here a "Try it online"-link for all code-snippets described above to see them in action.

Kevin Cruijssen
  • 9,153
  • 9
  • 61
  • 135
  • Is `\G` functionality clear to you or it is not either? – revo May 16 '18 at 08:00
  • 1
    _I know `\G` is used to stop after a non-match_ That's not quite correct, `\G` is used to indicate the position where the last match ended (or the start of the string for the first run) – Sebastian Proske May 16 '18 at 08:01
  • 1
    Probably [this answer](https://stackoverflow.com/a/29604177/3832970) explains what you want. – Wiktor Stribiżew May 16 '18 at 08:01
  • @revo I would say partially clear. I know how it works in the example I gave with the `match,match,...`, but I have no idea how it works inside the `.split`. it's also the first time I see `\G` today, so it's still fairly new for me, and probably not 100% clear. – Kevin Cruijssen May 16 '18 at 08:02
  • @WiktorStribiżew Hmm, so if I understand correctly, `\G` is normally a zero-length match, where in Java it indicates the position where the last match ended as stated by _SebastianProske_ above. But in Java inside the look-behind it matches the entire match instead of just the end, which causes these 'rules' to conflict with each other. I'm still a bit unsure why `.split("(?=\\G..)")` doesn't work though. Although I'll have to admit look-aheads/arounds/backs aren't really my expertise in general.. – Kevin Cruijssen May 16 '18 at 08:16
  • 1
    @KevinCruijssen The problem is that this `\G` in lookbehind case belongs to the "undefined" behaviors. It is a "miracle" it works like that in a Java regex. Thus, I would not rely on it since it does look like a bug (lookbehinds are to be non-consuming) and this can be fixed in any future Java versions. Same as `*` and `+` quantifiers in Java lookbehinds, that worked regardless of Java regex specs. – Wiktor Stribiżew May 16 '18 at 08:18
  • @WiktorStribiżew - I think this is reliable, because the lookbehind *is* non-consuming. The reason we progress through the string is that `split` does that, not the regex. – T.J. Crowder May 16 '18 at 08:37
  • 2
    @T.J.Crowder I did not delve into the code, probably you are right. However, `split("(?=\\G..)")` is supposed to work, it is again some Java quirk, compare [Java](https://ideone.com/bbyo2Q) vs. [PHP](https://ideone.com/9Isf2c), for instance. – Wiktor Stribiżew May 16 '18 at 08:47
  • @WiktorStribiżew - Why would a lookahead work? The example you gave there there doesn't split the string at all in Java (as I would expect). The PHP behavior seems bizarre. (It's even more bizarre with a lookbehind: https://ideone.com/S7UAmV) – T.J. Crowder May 16 '18 at 08:56
  • 2
    @T.J.Crowder If only I knew. Look, Python `re` does not split upon zero-length matches at all, [C#](http://rextester.com/MKU72549) split `12345678` with `(?=\G..)` into 2 items, `['','12345678']`, [Ruby](http://rextester.com/MIOUQ58465) splits the same way as PHP. All that indicates `\G` behavior in lookarounds is undefined, each language is free to interpret it in its own way, and I do not rely on such patterns. I am not against using them, but only by myself as "I know what I am doing". – Wiktor Stribiżew May 16 '18 at 09:19
  • @WiktorStribiżew - Sorry, I meant "I think this *(the lookbehind version)* is reliable" *in Java*. Seems to work [in C#](http://rextester.com/DDIT49033) and [in Ruby](http://rextester.com/LOOZ4009) too. Not sure why you're focussing on lookaheads? But your point of caution is well-taken, particularly in light of [PHP's behavior with the lookbehind](http://rextester.com/NLOI46412). – T.J. Crowder May 16 '18 at 09:44
  • 1
    Note: I'd expect `\G`, `(?=\G)`, and `(?<=\G)` to all behave the same, because they are all zero-width, and this is the case also in Java: [TIO](https://tio.run/##y0osS9TNSsn@/z85J7G4WME3MTOvmktBoaA0KSczWaG4JLEESJXlZ6Yo5AKlNIJLijLz0qNjFRI1QcoUFIIri0tSc/XyS0v0CoBSJTl5GllAM/VKSzJz9ByLihIri/VK8iHaNJQcTZwsnENdfV3DXY3dzdwtPQ19zP2i/A2CjIJNlfSKC3IySzSUNOxtbGNi3DWj43xjlTQ1Na2B1tDOLlSraGUP0BJkO2q5av//BwA). It looks like Java is removing all empty strings from the split result, so this is a little difficult to debug, but in my example you can see all three options are the same. – Kobi May 16 '18 at 12:38
  • Also, a little related: [Example of “use \G in negative variable-length lookbehinds to limit how far back the lookbehind goes”](https://stackoverflow.com/q/27562751/7586) – Kobi May 16 '18 at 12:49

2 Answers2

10

how does .split("(?<=\\G..)") work

(?<=X) is a zero-width positive lookbehind for X. \G is the end of the previous match (not some kind of stop instruction) or beginning of input, and of course .. is two individual characters. So (?<=\G..) is a zero-width lookbehind for the end of the previous match plus two characters. Since this is split and we're describing a delimiter, making the entire thing a zero-width assertion means we only use it to identify where to break the string, not to actually consume any characters.

So let's walk through ABCDEF:

  1. \G matches beginning of input, and .. matches AB, so (?<=\G..) finds the zero-width space between AB and CD because this is a lookbehind: That is, the first point at which there is \G.. prior to the regex cursor is the point between AB and CD. So split between AB and CD.
  2. \G marks the location just after AB so (?<=\G..) finds the zero-width space between CD and EF, because as the regex cursor goes forward, that's the first place where \G.. matches: \G matching the location between AB and CD and .. matching CD. So split between CD and EF.
  3. Same again: \G marks the location just after CD so (?<=\G..) finds the zero-width space between EF and end-of-input. So split between EF and end-of-input.
  4. Create an array with all of the matches except the empty one at the end (because this is split with an implicit length = 0 which discards empty strings at the end).

Result { "AB", "CD", "EF" }.

And why does .split("(?=\\G..)") not work?

Because (?=X) is a positive lookahead. The end of the previous match will never be ahead of the regex cursor. It can only be behind it.

T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875
  • 1
    Please check https://stackoverflow.com/a/29604177/3832970 - the point is that the lookbehind with `\G` actually *consumes* text (or at least, moves the index) in Java (although it is a non-consuming pattern). – Wiktor Stribiżew May 16 '18 at 08:05
  • @WiktorStribiżew - Read through it, but didn't find it particularly clear. I'd love to see an answer with your explanation of this. You are the master, after all. – T.J. Crowder May 16 '18 at 08:08
  • 2
    @WiktorStribiżew - The lookbehind with `\G` isn't *consuming* any text. Just marking where splits should occur. – T.J. Crowder May 16 '18 at 08:23
  • Yeah, consuming is a complex action, `\G` here just moves the index. – Wiktor Stribiżew May 16 '18 at 08:26
5

First off, \G definition: it's an anchor which matches beginning of string or end of previous match. It's a position. It neither does consume a character nor changes cursor position. Alan Moore previously in an answer wrote this behavior of \G inside lookbehinds is engine specific. This would split at desired length in Java but doesn't produce the same result in PCRE.

So how does \G in (?<=\G..) work? Look at below step-by-step demonstration of where dot and \G match:

 ↓A4
\G..↓B8
   \G..↓CU
      \G..
       .
       .

\G matches beginning of input string then dots match A and 4 in order. Engine continues traversing and stop right between 8 and C. Here lookbehind matches:

A   4   B  8
     \G .  . (?<=\G..)

Where \G matches is where previous dots ended matching i.e. position right after 4 and before B. This process continues to the end of input string. It splits a string by 2 units of data (safely a character here). It shouldn't work on multi-line input strings and if it does it splits partially since dot . doesn't match a newline character or it doesn't split at all since \G doesn't match start of a line (only start of input string).

And why does .split("(?=\\G..)") not work?

Because of a lookahead's nature - which looks forward - there is no possiblities for it to meet where previous match ended. It just continues walking, till to the end.

revo
  • 47,783
  • 14
  • 74
  • 117
  • 1
    Great answer as well! +1. I will leave the accepted mark at _@T.J.Crowder_'s answer though, since he was faster. Both his and yours are very good answers however, and answered both my questions.. – Kevin Cruijssen May 16 '18 at 08:58
  • 1
    My pleasure. I didn't answer to change the acceptance mark location just tried to shine some more light on the problem. – revo May 16 '18 at 09:09