3

How would the '.+?' regular expression work? Is the .+ part matching anything written, and the ? part saying it can either be there or not? So, for example, this regular expression would match:

'cat'
'' (ie, nothing written, just the empty string)

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Tony Stark
  • 24,588
  • 41
  • 96
  • 113

5 Answers5

10

The "+?" is not a "+" quantifier followed by a "?" quantifier. Instead the "?" modifies the "+" to perform a "lazy" or "non greedy" match, meaning that the least number of characters that match is already sufficient.

So a "a+?" regex would match just a single "a" in "caaat".

Hans Kesting
  • 38,117
  • 9
  • 79
  • 111
8

Besides what Hans Kesting already said, a lazy multiplier will do the exact oposite of the normal greedy multipliers: The possible match is kept as small as possible and the rest of the regular expression is tested.

So if you’re having the string aaba and test the regular expression a.*b on it, the internal processing steps would be as follows:

  1. a in a.*b matches aaba
  2. .* in a.*b matches aaba, and since .* is greedy
    1. .* then matches aaba
    2. .* then matches aaba
  3. b in a.*b fails as there is no letter left
    • backtracking goes one step back and .* will now only match bb in aaba
  4. b in a.*b still fails on aaba
    • backtracking goes one step back and .* now matches only b in aaba
  5. b in a.*b now matches b in aaba and we’re done.

So the full match is aaba.

If we do the same with a lazy multiplier (a.*?b), the processing will do the oposite, try to match the least possible characters as possible:

  1. a in a.*?b matches aaba
  2. .* in a.*?b matches nothing (* = zero or more repetitions), and since .* is declared as lazy (.*?), the rest of the regular expression is tested
  3. b in a.*?b fails on aaba
    • backtracking will try to increase the match of .*
  4. .* matches now aaba
  5. b in a.*?b matches aaba and we’re done.

So the full match if aaba.

Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • So a.*b would catch BOTH aab and ab, correct? But a.*?b only catches aab? (using your aaba example for both scenarios). So .*? isn't necessarily trying to find the LEAST possibly characters, but the first instance of '.' being 1 letter, and if it matches, done. So once a.*?b catches aab, it wouldn't go on to try catching just ab. – Tony Stark Aug 17 '09 at 12:53
  • @hatorade: No, sorry. Both expressions match only **`aab`**`a`. – Gumbo Aug 17 '09 at 12:58
  • @Gumbo: I don't see why .*? wouldn't catch ab. a.*?b clearly matches ab. it makes sense that .* catches aab because it is greedy and tries to get the max fit. but isn't that the point of lazy plus? – Tony Stark Aug 17 '09 at 13:06
  • @hatorade: The match is performed on the whole string and not just a substring. So **`aab`**`a` is the only match for both regular expressions. – Gumbo Aug 17 '09 at 13:15
  • @Gumbo: well, apparently it does work as you say according to python. guess i just need to wrap my head around it. – Tony Stark Aug 17 '09 at 13:18
4

+? (lazy plus)

Repeats the previous item once or more. Lazy, so the engine first matches the previous item only once, before trying permutations with ever increasing matches of the preceding item.

/".+?"/ matches "def" (and "ghi") in abc "def" "ghi" jkl, while /".+"/ matches "def" "ghi".

You can find more info here

Svish
  • 152,914
  • 173
  • 462
  • 620
rahul
  • 184,426
  • 49
  • 232
  • 263
  • 2
    The quotes in the original example confused me at first -- I thought they were just setting the pattern off from the text, not part of the pattern. I think adding slashes clarifies the intent. – tvanfosson Aug 17 '09 at 12:05
  • as Svish apparently edited, /".+?"/ would match both "def" and "ghi" right? Separately. So it would return two expressions. Whereas /".+"/ catches the entire phrase "def" "ghi" counting def" "ghi as the .+ part? – Tony Stark Aug 17 '09 at 12:56
  • @hatorade, as far as I could see testing it in RegexBuddy, yes. – Svish Aug 17 '09 at 13:05
  • @Svish i tested in python and .+? seems to just return "def" and didn't mention catching "ghi" but that's probably because i need something else in my regex to indicate i want all the matches of the .+? – Tony Stark Aug 17 '09 at 13:12
1

There is documentation on how Perl handles these quantifiers perldoc perlre.

By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match. If you want it to match the minimum number of times possible, follow the quantifier with a "?". Note that the meanings don't change, just the "greediness":
    *?     Match 0 or more times, not greedily
    +?     Match 1 or more times, not greedily
    ??     Match 0 or 1 time, not greedily
    {n}?   Match exactly n times, not greedily
    {n,}?  Match at least n times, not greedily
    {n,m}? Match at least n but not more than m times, not greedily
By default, when a quantified subpattern does not allow the rest of the overall pattern to match, Perl will backtrack. However, this behaviour is sometimes undesirable. Thus Perl provides the "possessive" quantifier form as well.
    *+     Match 0 or more times and give nothing back
    ++     Match 1 or more times and give nothing back
    ?+     Match 0 or 1 time and give nothing back
    {n}+   Match exactly n times and give nothing back (redundant)
    {n,}+  Match at least n times and give nothing back
    {n,m}+ Match at least n but not more than m times and give nothing back
For instance,
   'aaaa' =~ /a++a/
will never match, as the a++ will gobble up all the a 's in the string and won't leave any for the remaining part of the pattern. This feature can be extremely useful to give perl hints about where it shouldn't backtrack. For instance, the typical "match a double-quoted string" problem can be most efficiently performed when written as:
   /"(?:[^"\\]++|\\.)*+"/
as we know that if the final quote does not match, backtracking will not help. See the independent subexpression (?>...) for more details; possessive quantifiers are just syntactic sugar for that construct. For instance the above example could also be written as follows:
   /"(?>(?:(?>[^"\\]+)|\\.)*)"/

link

Community
  • 1
  • 1
Brad Gilbert
  • 33,846
  • 11
  • 78
  • 129
0

inevitably, the regex is going to look for at least one character. I've come across case where an empty string wouldn't pass that test already, it would be better to use .*? or (.*)? instead, sometimes you have to specify the part of the string which may be null in braces before the question mark, it helps. E.g. \d{6}? will yield a wrong result, whereas if I had said (\d{6})? in a string say for example:

preg_match("/shu\.(\d{6})?/", "shu.321456")

this will yield true and so will the string "shu." without any int after the period

sjngm
  • 12,423
  • 14
  • 84
  • 114
pythonian29033
  • 5,148
  • 5
  • 31
  • 56