7

In the pypi page of the awesome regex module (https://pypi.python.org/pypi/regex) it is stated that \G can be used "in negative variable-length lookbehinds to limit how far back the lookbehind goes". Very interesting, but the page doesn't give any example and my white-belt regex-fu simply chokes when I try to imagine one.

Could anyone describe some sample use case?

Ron
  • 7,588
  • 11
  • 38
  • 42
  • 2
    I think it is meant to refer to `(?<!\G\w*)`, or something like this. – nhahtdh Dec 19 '14 at 09:12
  • OK, ninja edited, but it is somewhat the same concept... I can't think of any use case for this, though. – nhahtdh Dec 19 '14 at 09:14
  • I'm still wondering why the doc specifically says "negative lookbehind" though, rather than simply lookbehinds – Ron Dec 19 '14 at 09:18
  • 1
    `(?<!...)` negative lookbehind `(?<=..)` positive lookbehind. If there is a quantifier `?` or `*` or `+` present inside the lookehinds, then it's called variable length positive or negative lookbehind. `(?<!\G\w*)` called variable length negative lookbehind, because it has a quantifier `*` – Avinash Raj Dec 19 '14 at 09:20
  • It probably considers the complexity needed to get a "success" from a variable-length look-behind. For positive look-behind, it only need to go back as many characters for the pattern to match, while for negative look-behind, it needs to check to the beginning of the string. However, it baffles me, since the reverse would happen when we consider the failing case for the look-behind. – nhahtdh Dec 19 '14 at 09:23
  • 1
    @sln to be honest, I was quite busy the last few days so I couldn't spend time on thinking of a solution/example to this question. Since the question "didn't reveive enough attention", I've put a bounty on it. – HamZa Dec 29 '14 at 12:04

3 Answers3

6

Here's an example that uses \G and a negative lookbehind creatively:

regex.match(r'\b\w+\b(?:\s(\w+\b)(?<!\G.*\b\1\b.*\b\1\b))*', words)

words should be a string of alphanumeric characters separated by a single whitespace, for example "a b c d e a b b c d".

The pattern will match a sequence of unique words.

  • \w+ - Match the first word.
  • (?:\s(\w+\b) )* - match additional words ...
  • (?<!\G.*\b\1\b.*\b\1\b) - ... but for each new word added, check it didn't already appear until we get to \G.

A lookbehind at the end of the pattern that is limited at \G can assert another condition on the current match, which would not have been possible otherwise. Basically, the pattern is a variation on using lookaheads for AND logic in regular expressions, but is not limited to the whole string.

Here's a working example in .Net, which shares the same features.
Trying the same pattern in Python 2 with findall and the regex module gives me a segmentation fault, but match seems to work.

Community
  • 1
  • 1
Kobi
  • 135,331
  • 41
  • 252
  • 292
  • 3
    Doesn't crash for me (Python 3.4.1) but `findall` returns `['', '', '', '', '', '']` which doesn't seem useful. `match` **does** return `` which is neat!-) – Alex Martelli Dec 27 '14 at 18:20
  • I think the lookbehind using `\G` position works when its the first sub-expression. I don't see it as viable being the last one (on the right). –  Dec 28 '14 at 22:23
  • Yeah, you're right it only checks up to the current match position. I was confused a bit, thought it was open ended –  Dec 30 '14 at 19:26
3

One example I could think of is using \G in a positive lookbehind to split a CSV row by commas:

regex.split(r'(?<=\G(?:"[^"]*(?:""[^"]*)*"|[^"]*)),', csv)

This is a variant you can only do with variable length lookbehind and \G.

Usually if you want to use split you add a lookahead until the end of the row, and check that the following records are all valid, like here ,(?=([^\"]*\"[^\"]*\")*[^\"]*$). This is somewhat annoying because you keep matching the end on the string over and over.

We also never explicitly mention that the unquoted values are not commas ([^,]) because we use \G to match since the previous comma.

Community
  • 1
  • 1
Kobi
  • 135,331
  • 41
  • 252
  • 292
  • 3
    Just a note. The split regex just validates the next comma is outside of dbl-quotes. I doesn't validate any evenness of the remaining quotes. The second regex (and link) is a really bad way of validation (as you say) since its like a N-factorial overhead. If validation is required, it could be done once per line, then use a normal field regex on it, avoid split altogether. Was benchmarked and discussed -> [here](http://stackoverflow.com/questions/27450102/how-to-convert-comma-delimited-file-to-pipe-delimited-in-vb-net/27451004#27451004). –  Dec 28 '14 at 22:37
  • 2
    `[^"]*(?:""[^"]*)*` can be written as `(?:[^"]|"")*`. And depending on the engine, it can be made possessive. – nhahtdh Dec 29 '14 at 07:27
  • 1
    @nhahtdh - You're right. I went with [unrolling-the-loop](http://stackoverflow.com/a/5455705/7586) without thinking about it too much. `(?:[^"]|"")*+` can work nicely here. – Kobi Dec 29 '14 at 07:33
2

This is an example that might be more useful than creative.
It uses a variable length negative lookbehind starting from the last match (\G)
checking forward to the current position.

In this case (?<!\Ga+)b, the result is that it matches every other b in a string where b's are separated by one or more a's.

This can also be done in a fixed width lookbehind like (?<!\Ga)b where the result is that it matches every other b in a string where b's are separated by one a.

This is kind of a template where a and b could be bigger expressions and have a little more
meaning.

(One thing to be aware of is that when using \G in a negative lookbehind, it is fairly easy to
satisfy the negative assertion. So, these kind of things have gotcha written all over it !!
)

Don't have Python (latest, beta?) to test this with, so below is using C# console app.

string strSrc = "abaabaabaabaabaab";
Regex rxGtest = new Regex(@"(?<!\Ga+)b");
Match _m = rxGtest.Match(strSrc);
while (_m.Success)
{
    Console.WriteLine("Found: {0} at position {1}", _m.Groups[0].Value, _m.Index);
    _m = _m.NextMatch();
}

Output:

Found: b at position 4
Found: b at position 10
Found: b at position 16