?
means many different things in a regex, depending on context.
When used as a quantifier it means "optional". For example, [0-9]?
matches 0 or 1 digits (an optional digit).
However, when applied to another quantifier, it means "non-greedy". Normal quantifiers try to match as many times as possible (giving up matches only if the rest of the regex would fail otherwise). Non-greedy matching inverts this: It first tries to match as few times as possible and only matches more if the rest of the regex would fail otherwise.
For example, a[a-z]*b
applied to abcdebafcbad
matches abcdebafcb
(or (abcdebafcb)ad
(using parentheses to mark the match)). [a-z]*
will first consume the whole string, only giving back characters until the following b
can match.
However, a[a-z]*?b
applied to abcdebafcbad
matches ab
(or (ab)cdebafcbad
(using parentheses to mark the match)). [a-z]*?
first starts out by matching no characters at all, and because b
is able to match immediately, that's where the regex stops.
As for your examples:
The first thing you need to understand about regexes is the outer loop. There is a loop that tries to invoke the regex engine at every position in the input string, from left to right. If a match is found, the loop stops and reports success. If all positions are exhausted without finding a match, the loop stops and reports failure.
(Strictly speaking we're not looping over the characters in the input string, but over the gaps between characters. ab
has 3 possible match positions: before a
, between a
and b
, and after b
.)
Your first regex, apples??
, is equivalent to apple(?:|s)
. We match apple
followed by either an empty string or s
(i.e. we try to match nothing first). Since the empty regex always matches and this is the end of the regex, there's nothing that could come later on and force us to revisit this decision. The regex is equivalent to just apple
.
Your second regex, .*?[0-9]*
, is a bit curious. The first thing you should notice is that all parts of it are optional, i.e. it can match a string of length 0. Because an empty string can be matched anywhere in the input and the outer loop mentioned above starts at offset 0, this regex will always match at the beginning of the input string.
The second thing to notice is that it attempts to reimplement the outer loop within the regex: .*?
will consume 0 characters at first, then (if the rest of the regex doesn't match) 1 characters, then 2, ... until a match is found. This is a bit pointless because the outer loop already does exactly that.
If the input string is page 26
, .*?
will start out by matching as few characters as possible, i.e. none. Then [0-9]*
will try to match as many digits as possible, but p
is not a digit, so "as many as possible" is also none. Thus .*?[0-9]*
matches the empty string at the beginning of page 26
: ()page26
(using parentheses to mark the match).
Your third regex, .*?[0-9]+
, still contains that redundant explicit loop at the beginning, .*?
. But now the digits part is not optional: [0-9]+
requires at least one digit to match.
If the input string is page 26
, .*?
will start out by matching as few characters as possible, i.e. none. Then [0-9]+
will try to match as many digits as possible, but at least one. This fails because p
is not a digit. Because [0-9]+
failed, we backtrack into .*?
and try to consume one more character (p
). Then we try [0-9]+
against the remaining input string, age 26
. This also fails; we backtrack and consume one more character in .*?
(pa
). Then we try [0-9]+
against ge 26
, which still fails. ...
This continues until .*?
has consumed page
. At this point [0-9]+
finally finds a digit to match, 2
. Because +
is greedy, we consume all available digits at this position, 26
. The final match is (page 26)
(i.e. the whole input string) with .*?
matching page
and [0-9]+
matching 26
.