3

I noticed that there are 3 different classes of quantifiers: greedy, lazy (i.e. non-greedy) and possessive.

I know that, loosely speaking, greedy quantifiers try to get the longest match by first reading in the entire input string and then truncate the characters one by one if the attempts keep failing; lazy quantifiers try to get the shortest match by first reading in the empty string and then add in the characters one by one if the attempts keep failing; possessive quantifiers try the same way as greedy quantifiers while they will stop matching if the first attempt fails.

However, I'm not sure how exactly the aboves are being implemented 'internally', and would like to ask for clarification (hopefully with examples).


For example, say we have the input string as "fooaaafoooobbbfoo".

If the regex is "foo.*" (greedy), will the foo in the regex first match the foo in the input string, and then .* reads in aaafoooobbbfoo as 'the entire string'? Or will .* first read in fooaaafoooobbbfoo as 'the entire string', and then truncates fooaaafoooobbbfoo to try matching the foo in the regex? If it is the latter, will fooaaafoooobbbfoo be truncated from its left or from its right in each attempt?

Will the answers to the above questions change if I replace "foo.*" with ".*foo" or "foo.*foo" as my regex? What about if I change those greedy quantifiers to lazy ones and possessive ones?

And if there are more than one quantifiers in a regex, how will the engine deal with the priority (if that matters)?


Thanks in advance!

J-A-S
  • 368
  • 1
  • 8
  • 3
    Tip: use the [Regex Debugger](https://regex101.com/r/BlgaSY/1/debugger) to view the steps live. – 41686d6564 stands w. Palestine Oct 27 '20 at 08:25
  • This is [related answer](https://stackoverflow.com/questions/5319840/greedy-vs-reluctant-vs-possessive-qualifiers) but this one is asking about internal working of greedy / lazy (non-greedy) / possessive quantifiers. – anubhava Oct 05 '22 at 11:40

1 Answers1

4

For your input string fooaaafoooobbbfoo.

Case 1: When you're using this regex:

foo.*

First remember this fact that engine traverses from left to right.

With that in mind above regex will match first foo which is at the start of input and then .* will greedily match longest possible match which is rest of the text after foo till end. At this point matching stops as there is nothing to match after .* in your pattern.

Case 2: When you're using this regex:

.*foo

Here again .* will greedily match longest possible match before matching last foo which is right the end of input.

Case 3: When you're using this regex:

foo.*foo

Which will match first foo found in input i.e. foo at the start then .* will greedily match longest possible match before matching last foo which is right the end of input.

Case 4: When you're using this regex with lazy quantifier:

foo.*?foo

Which will match first foo found in input i.e. foo at the start then .*? will lazily match shortest possible match before matching next foo which is second instance of foo starting at position 6 in input.

Case 5: When you're using this regex with possessive quantifier:

foo.*+foo

Which will match first foo found in input i.e. foo at the start then .*+ is using possessive quantifier which means match as many times as possible, without giving back. This will match greedily longest possible match till end and since possessive quantifier doesn't allow engine to backtrack hence presence of foo at the end of part will cause failure as engine will fail to match last foo.

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 1
    Thank you very much for your answers! May I ask for further clarification on the details of this process: Does this mean, say for `"foo.*foo"`, the regex will first find the 1st occurrence of `foo` in the input (which is at the start), then, `.*` will read in `aaafoooobbbfoo`, next, `.*` will start to truncate `aaafoooobbbfoo` one char by one char until what has been truncated matches with the last part of the regex (which is `foo`)? If so, may I ask how exactly does this truncating go? Does it truncate `aaafoooobbbfoo` from left to right or from right to left? – J-A-S Oct 27 '20 at 08:10
  • Truncating or **backtracking** is one character at a time. So engine backtracks one position and reattempts matching `foo` and repeats this process until a match is success or failure. In this case match is success as soon as last `foo` is matched. – anubhava Oct 27 '20 at 08:23
  • 1
    So may I confirm with you if the engine backtracks **from right to left**? (so in this case, `aaafoooobbbfoo` becomes `aaafoooobbbfo` and then `aaafoooobbbf` and then `aaafoooobbb` and then the match is done) Thank you for your patience :) – J-A-S Oct 27 '20 at 08:36
  • @J-A-S: Yes you are spot on. Engine moves back1 position at a time attempting to match `foo` at positions `aaafoooobbbfoo, aaafoooobbbfo, aaafoooobbbf, aaafoooobbb`. This is when it succeeds in matching `foo`. – anubhava Oct 27 '20 at 09:13
  • 1
    Also I tried the regex `".*foo.*foo"` against the input, and just found out that, at the first encounter of the second `foo` in the regex, in the input the engine will move back to the index where one char before the last `foo` and backtrack for the first `foo` of the regex from there. This allows the second `foo` in the regex to match the last `foo` in the input. Anyway I think this worth mentioning :) – J-A-S Oct 27 '20 at 09:29