7

Note:

  • The observed behavior is correct, but may at first be surprising; it was to me, and I think it may be to others as well - though probably not to those intimately familiar with regex engines.

  • The repeatedly suggested duplicate, Regex lookahead, lookbehind and atomic groups, contains general information about look-around assertions, but does not address the specific misconception at hand, as discussed in more detail in the comments below.


Using a greedy, by definition variable-width subexpression inside a positive look-behind assertion can exhibit surprising behavior.

The examples use PowerShell for convenience, but the behavior applies to the .NET regex engine in general:

This command works as I intuitively expect:

# OK:  
#     The subexpression matches greedily from the start up to and
#     including the last "_", and, by including the matched string ($&) 
#     in the replacement string, effectively inserts "|" there - and only there.
PS> 'a_b_c' -replace '^.+_', '$&|'
a_b_|c

The following command, which uses a positive look-behind assertion, (?<=...), is seemingly equivalent - but isn't:

# CORRECT, but SURPRISING:
#   Use a positive lookbehind assertion to *seemingly* match
#   only up to and including the last "_", and insert a "|" there.
PS> 'a_b_c' -replace '(?<=^.+_)', '|'
a_|b_|c  # !! *multiple* insertions were performed

Why isn't it equivalent? Why were multiple insertions performed?

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 2
    **This is a duplicate of [Regex lookahead, lookbehind and atomic groups](https://stackoverflow.com/questions/2973436/regex-lookahead-lookbehind-and-atomic-groups) and also [this thread](https://stackoverflow.com/a/11197672/3832970), too.** – Wiktor Stribiżew Mar 03 '21 at 15:39
  • 6
    @WiktorStribiżew Reposting the same information without addressing the points made to the contrary is unhelpful. The comments on the answer below explain why your links don't help. That the additional link you've posted this time doesn't help either is easily verified by looking for the terms "variable-width", "variable-length", and "greedy" - which _this_ question is about - there: you won't find them. – mklement0 Mar 03 '21 at 16:06
  • 5
    Just to spare future readers potentially wasted effort, and to balance the shouting in the comment above: **The alleged duplicate(s) _generically_ describe the behavior of look-around assertions. They do _not_ address this question's _specific misconception_**. While you can hypothetically _infer_ the explanation from the linked posts, such an inference is far from obvious. **Only the answer below provides a specific, (hopefully) clear explanation**. – mklement0 Apr 05 '21 at 21:40
  • 1
    Does this answer your question? [Regex lookahead, lookbehind and atomic groups](https://stackoverflow.com/questions/2973436/regex-lookahead-lookbehind-and-atomic-groups) – Kevin B Aug 04 '21 at 15:04

1 Answers1

7

tl;dr:

  • Inside a look-behind assertion, a greedy subexpression in effect behaves non-greedily (in global matching in addition to acting greedily), due to considering every prefix string of the input string.

My problem was that I hadn't considered that, in a look-behind assertion, each and every character position in the input string must be checked for the preceding text up to that point to match the subexpression in the lookbehind assertion.

This, combined with the always-global replacement that PowerShell's -replace operator performs (that is, all possible matches are performed), resulted in multiple insertions:

That is, the greedy, anchored subexpression ^.+_ legitimately matched twice, when considering the text to the left of the character position currently being considered:

  • First, when a_ was the text to the left.
  • And again when a_b_ was the text to the left.

Therefore, two insertions of | resulted.


By contrast, without a look-behind assertion, greedy expression ^.+_ by definition only matches once, through to the last _, because it is only applied to the entire input string.

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 2
    This is a [known lookaround behavior](https://stackoverflow.com/questions/11197608/fixed-length-regex-required/11197672#11197672). No need to repeat it. This is also part of the [Regex FAQ](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean). – Wiktor Stribiżew Mar 02 '21 at 21:19
  • 3
    @WiktorStribiżew re your links: the FAQ (undoubtedly a good general resource) merely links to the first linked answer, which is about _non-support for variable-width patterns_ in lookbehind assertions in Python. By contrast, _this_ question is precisely about _variable-width_ patterns, and it is precisely that variable-width nature that gave rise to the misconception exhibited in my question. The only reference in the FAQ to variable-width lookbehinds is [this answer](https://stackoverflow.com/a/20994257/45375), which merely states that .NET does support them in general. – mklement0 Mar 03 '21 at 14:13
  • 5
    In short: The behavior _isn't_ described in the links you've posted - not in any meaningful way that would clear up the specific misconception at hand. This answer now does describes it, and it will hopefully clear up the misconception for others too. – mklement0 Mar 03 '21 at 14:13