1

I have the following string: "text before AB000CD000CD text after". I want to match text from AB to the first occurrence of CD. Inspired by this answer, I created the following regex pattern:

AB((?!CD).)*CD

I checked the result in https://regex101.com/ and the output is:

Full match  12-19   `AB000CD`
Group 1.    16-17   `0`

Looks like it does what I need. However I don't understand why it works. My understanding is that my pattern should match AB first, then any character that is not followed by CD, and then CD itself. But following this logic, the result should not include 000, but only 00 because the last zero is actually followed by CD. Is my explanation wrong?

username
  • 3,378
  • 5
  • 44
  • 75
  • 1
    You were close, it should be `AB((?:(?!CD).)*)CD`, the capture group should _enclose_ an inner cluster quantified group `(?:(?!CD).)*`. Doing it this way allows it to capture the entire content between AB and CD. –  Jul 11 '17 at 23:59
  • `the result should not include 000, but only 00 because the last zero is actually followed by CD. Is my explanation wrong` _Yes, that is wrong_. Say the current position is _here->_ `0CD`, `0C != CD` so the `0` is consumed. Then it is _here->_ `CD`. Since `CD == CD` it fails, then moves on to the next part of the regex where it matches `CD`. –  Jul 12 '17 at 00:07

1 Answers1

3

AB((?!CD).)*CD matches AB, then any char that does not start a CD char sequence, and then CD. That is where you are wrong saying "that is not followed by CD". Note that the negative lookahead is located before the ..

Besides, it makes no sense using the tempered greedy token when the negated part is the same as the trailing boundary, just use a lazy dot matching pattern, AB(.*?)CD. You need to use the construct when you do not want to match AB (the initial boundary) in between the AB and CD, ie. AB((?:(?!AB).)*?)CD (it the most common use case).

See rexegg.com reference about when to use it:

Suppose our boss now tells us that we still want to match up to and including {END}, but that we also need to avoid stepping over a {MID} section, if it exists. Starting with the lazy dot-star version to ensure we match up to the {END} delimiter, we can then temper the dot to ensure it doesn't roll over {MID}:

{START}(?:(?!{MID}).)*?{END}

If more phrases must be avoided, we just add them to our tempered dot:

{START}(?:(?!{MID})(?!{RESTART}).)*?{END}

Also, see this thread.

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    There are actually reasons to prefer `((?!CD).)*CD` over `(.*?)CD`. For example, `((?!CD).)*CD` will never match a `CD` in the `((?!CD).)*` part, while `(.*?)CD` could do so if the search backtracks. That won't matter when there's nothing in the regex after `CD`, as is the case here, but it can matter for other regexes. – user2357112 Jul 11 '17 at 23:41
  • @user2357112 Perhaps, I should add another quote from rexegg.com, **When Not to Use this Technique**: *"For the task at hand, this technique presents no advantage over the lazy dot-star `.*?{END}`. Although their logic differs, at each step, before matching a character, both techniques force the engine to look if what follows is `{END}`."* The regex "at hand" was `{START}(?:(?!{END}).)*{END}`, analogous to OP regex. – Wiktor Stribiżew Jul 11 '17 at 23:45