why doesn't it match the whole text?
Oh, it tried. But it found the sequence p
, r
, o
, j
, 1
in one spot, then went on to find zero or more characters not being a comma, therefore matching .
, p
, d
, f
. And it stopped there, since the next character, being a comma, did not match [^,]
.
Note that the next matching attempt will start at the next character, ie r
, and so on until it finds a p
; when it finds one, it will try r
, etc etc.
The regex being satisfied in its entirety, the engine decided that it was a success and didn't try any further, even though there are matches beyond that point.
The text matched is therefore proj1.pdf
. And not the entire input. Regular expressions are lazy, they only match what they need to match and never go any further.
BUT. And this is where it gets interesting. Some engines don't work this way.
Consider the regex cat(|flap)
and the input text catflap
. POSIX has had a go at regex engines, and dictated that a regex engine should match the leftmost, longest match.
So, if a regex engine obeys POSIX, it should match catflap
. But most regex engines in existence, here, will only match cat
: the empty alternation matches first, the regex is satisfied, end of story!
Now, to the core of the question: quantifiers are of three types, greedy, lazy and possessive:
- greedy: the default,
*
;
- lazy, aka the overused:
*?
;
- possessive:
*+
.
A greedy quantifier will try and match as much text as it can, and give back only when it has to; a lazy quantifier will try and match as little text as it can; a possessive quantifier will try and match as much text as it can, and it will not give text back.
Illustration: here is the input text:
The answer to everything is 42, says the mouse
Here are three regexes to match this text, with a capturing group:
.*(\d+)
(greedy);
.*?(\d+)
(lazy);
.*+(\d+)
(possessive).
Question: what will the group capture in each of these expressions? Answer:
- the first:
2
;
- the second:
42
;
- the third: will not even match the text!
.*+
will have swallowed everything but will not give back, \d+
is therefore left with nothing which can match, regex failure.