21

I'm having a difficulty with understanding how \G anchor works in PHP flavor of regular expressions.

I'm inclined to think (even though I may be wrong) that \G is used instead of ^ in situations when multiple matches of the same string are taking place.

Could someone please show an example of how \Gshould be used, and explain how and why it works?

koopajah
  • 23,792
  • 9
  • 78
  • 104
Dimitri Vorontzov
  • 7,834
  • 12
  • 48
  • 76
  • Take a look a this answer for a real example : http://stackoverflow.com/a/2248130/1606729 – koopajah Feb 15 '13 at 15:35
  • @koopajah - thank you. Unfortunately, it's not a proper example. I'm asking about using \G anchor; the example you linked to is using \g for backreference. – Dimitri Vorontzov Feb 15 '13 at 15:38
  • Thanks again, @koopajah. The new example indeed uses \G, but from that example I still can't understand anything about how and why \G should be used. The only thing I see is that \G is used there, but why it's used, in what other situations it should be used, and so on - I do not understand that. More examples, please? – Dimitri Vorontzov Feb 15 '13 at 15:42
  • http://stackoverflow.com/a/14294280/1400768 http://stackoverflow.com/a/14465042/1400768 It is not going to be easy to read, though... – nhahtdh Feb 15 '13 at 15:45
  • Thank you @nhahtdh – first example is in Java, I'm asking about PHP. Second example is general regex, but again, I'm specifically interested in PHP flavor. One way or another, I'd be grateful for the kind of simple example that would clarify the matter for me. These ones made me even more confused. ;-) – Dimitri Vorontzov Feb 15 '13 at 15:50
  • 2
    @DimitriVorontzov: They works the same way in both language. – nhahtdh Feb 15 '13 at 15:51
  • @nhahtdh - great, thanks for clarifying that. But how do they work and what makes them work? If you have good grasp of the subject, may I please ask you to write a simple example (some simple regex pattern applied to some really basic string, using \G) - and comment it? – Dimitri Vorontzov Feb 15 '13 at 15:52
  • See also: [Purpose of the \G anchor in regular expressions](http://stackoverflow.com/questions/3427825) – hippietrail Jul 29 '14 at 14:03

2 Answers2

17

UPDATE

\G forces the pattern to only return matches that are part of a continuous chain of matches. From the first match each subsequent match must be preceded by a match. If you break the chain the matches end.

<?php
$pattern = '#(match),#';
$subject = "match,match,match,match,not-match,match";

preg_match_all( $pattern, $subject, $matches );

//Will output match 5 times because it skips over not-match
foreach ( $matches[1] as $match ) {
    echo $match . '<br />';
}

echo '<br />';

$pattern = '#(\Gmatch),#';
$subject = "match,match,match,match,not-match,match";

preg_match_all( $pattern, $subject, $matches );

//Will only output match 4 times because at not-match the chain is broken
foreach ( $matches[1] as $match ) {
    echo $match . '<br />';
}
?>

This is straight from the docs

The fourth use of backslash is for certain simple assertions. An assertion specifies a condition that has to be met at a particular point in a match, without consuming any characters from the subject string. The use of subpatterns for more complicated assertions is described below. The backslashed assertions are

 \G
    first matching position in subject

The \G assertion is true only when the current matching position is at the start point of the match, as specified by the offset argument of preg_match(). It differs from \A when the value of offset is non-zero.

http://www.php.net/manual/en/regexp.reference.escape.php

You will have to scroll down that page a bit but there it is.

There is a really good example in ruby but it is the same in php.

How the Anchor \z and \G works in Ruby?

Community
  • 1
  • 1
Jared
  • 12,406
  • 1
  • 35
  • 39
  • Thank you @Jrod, it's a step in the right direction for me, and I appreciate your posting the link to the docs. Unfortunately, being relatively new to PHP and programming in general, I'm not grasping the actual, practical meaning of that thing from the documents at all. That's why I'm asking for an example. – Dimitri Vorontzov Feb 15 '13 at 15:56
  • @DimitriVorontzov I added a simple example. I hope that makes it clearer. – Jared Feb 15 '13 at 17:03
12

\G will match the match boundary, which is either the beginning of the string, or the point where the last character of last match is consumed.

It is particularly useful when you need to do complex tokenization, while also making sure that the tokens are valid.

Example problem

Let us take the example of tokenizing this input:

input 'some input in quote' more input   '\'escaped quote\''   lots@_$of_fun    ' \' \\  ' crazy'stuff'

Into these tokens (I use ~ to denote end of string):

input~
some input in quote~
more~
input~
'escaped quote'~
lots@_$of_fun~
 ' \  ~
crazy~
stuff~

The string consists of a mix of:

  • Singly quoted string, which allows the escape of \ and ', and spaces are conserved. Empty string can be specified using singly quoted string.
  • OR unquoted string, which consists of a sequence of non-white-space characters, and does not contain \ or '.
  • Space between 2 unquoted string will delimit them. Space is not necessary to delimit other cases.

For the sake of simplicity, let us assume the input does not contain new line (in real case, you need to consider it). It will add to the complexity of the regex without demonstrating the point.

The RAW regex for singly quoted string is '(?:[^\\']|\\[\\'])*+'
And the RAW regex for unquoted string is [^\s'\\]++
You don't need to care too much about the 2 piece of regex above, though.

The solution below with \G can make sure that when the engine fails to find any match, all characters from the beginning of the string to the position of last match has been consumed. Since it cannot skip character, the engine will stop matching when it fails to find valid match for both specifications of tokens, rather than grabbing random stuff in the rest of the string.

Construction

At the first step of construction, we can put together this regex:

\G(?:'((?:[^\\']|\\[\\'])*+)'|([^\s'\\]++))

Or simply put (this is not regex - just to make it easier to read):

\G(Singly_quote_regex|Unquoted_regex)

This will match the first token only, since when it attempts matching for the 2nd time, the match stops at the space before 'some input....


We just need to add a bit to allow for 0 or more space, so that in the subsequent match, the space at the position left off by the last match is consumed:

\G *+(?:'((?:[^\\']|\\[\\'])*+)'|([^\s'\\]++))

The regex above will now correctly identify the tokens, as seen here.


The regex can be further modified so that it returns the rest of the string when the engine fails to retrieve any valid token:

\G *+(?:'((?:[^\\']|\\[\\'])*+)'|([^\s'\\]++)|((?s).+$))

Since the alternation is tried in order from left-to-right, the last alternative ((?s).+$) will be match if and only if the string ahead doesn't make up a valid single quoted or unquoted token. This can be used to check for error.

The first capturing group will contain the text inside single quoted string, which needs extra processing to turn into the desired text (it is not really relevant here, so I leave it as an exercise to the readers). The second capturing group will contain the unquoted string. And the third capturing group acts as an indicator that the input string is not valid.

Demo for the final regex

Conclusion

The above example is demonstrate of one scenario of usage of \G in tokenization. There can be other usages that I haven't come across.

nhahtdh
  • 55,989
  • 15
  • 126
  • 162