0

After reading though How does \G work in .split? I quickly set up a Delphi program to check how PCRE handles this case. Interestingly the results were not the same as in the java case:

program Project1;
{$APPTYPE CONSOLE}
uses
  System.RegularExpressions;
var
  SArr: TArray<string>;
  S: string;
begin
  SArr := TRegex.Split('abcdefghij', '(?<=\G..)',[]);
  for S in SArr do
  begin
    WriteLn(S);
  end;
  ReadLn;
end.

Outputs:

ab
cde
fgh
ij

Why does the PCRE result differ from the Java one? How is this behaviour to be explained?

To make sure this isn't an Delphi error, I tested in regex 101 and the matching behaviour seems to be the same: https://regex101.com/r/GE6eRI/1

Sebastian Proske
  • 8,255
  • 2
  • 28
  • 37
  • From my experience, `'(?<=\G..)'` only works in Java regex, that is why I said I would not rely on this kind of patterns. There are ways to work around it, fortunately (eg. matching `/../g`, etc.). – Wiktor Stribiżew May 16 '18 at 08:53
  • It seems first call of `\G` makes an advance in the next position it should match otherwise affecting this pointer with `\K` shouldn't change anything [`(?<=\K\G..)`](https://regex101.com/r/vMKLdf/1) – revo May 16 '18 at 09:34
  • I encourage you to add `java` tag since it is an inter-flavor comparison. – revo May 16 '18 at 09:38
  • @revo Add Java and regex-lookarounds. I wasn't sure about both the java tag (as my question doesn't use any java code) and the delphi tag (as it's just used to show the pcre behaviour), but it seems to make sense to include both. – Sebastian Proske May 16 '18 at 09:42

1 Answers1

2

I'd like to quote from Alan Moore:

This trick will work (for example) in Java, Perl, .NET and JGSoft, but not in PHP (PCRE), Ruby 1.9+ or TextMate (both Oniguruma)

A quote from PCRE docs that I think applies here:

Note, however, that PCRE's interpretation of \G, as the start of the current match, is subtly different from Perl's, which defines it as the end of the previous match. In Perl, these can be different when the previously matched string was empty. Because PCRE does just one match at a time, it cannot reproduce this behaviour.

It seems \G token in a lookbehind in PCRE tackles with zero-length matching problem because when \G matches in a lookbehind it advances one character. Suppose below regex:

(?<=\G)

and input string:

abcd

with global modifier on it matches 5 positions (see live demo). But we expect to match one and only one position like how Java behaves. A workaround with PHP to produce the same result as Java is using \K along with a capturing group:

(?<=\K\G(..))

The same, above task could be done with:

\G..\K
revo
  • 47,783
  • 14
  • 74
  • 117