-2

I am using python3 re module - I find that a*a matches aaa. I thought that regex was greedy by default (unless we override it to be lazy with ?) - so, a* would have matched the entire string, and the trailing a in the pattern would fail. However, it matches:

$ import re
$ re.match(r'a*a', 'aaa')
<_sre.SRE_Match object; span=(0, 3), match='aaa'>

Should this not fail?

Anand
  • 3,690
  • 4
  • 33
  • 64
  • 3
    greedy doesn't mean that it _must_ swallow everything it can, it only means it _prefers to_. – tkausl Sep 09 '18 at 03:37
  • 1
    Recommended research: "Backtracking" in regular expressions – Ben Voigt Sep 09 '18 at 03:40
  • 1
    Your expectation is called a possessive match. Regarding to your engine this feature may or may not be available. In PCRE `a*+a` wouldn't match `aaa` because `a*+` consumes every `a` possessively (doesn't backtrack). But I marked two questions on greediness and backtracking that will help you. – revo Sep 09 '18 at 04:21

2 Answers2

5

It does initially attempt to match the entire string, but repetition will backtrack if a match fails. After the a* initially matching the entire string, the regex tries to match the next token, the single a This fails, so then the a* backtracks back a character (so that it only matches aa rather than aaa). This time, the last token, the single a, is fulfilled, so a match is found.

Greedyness doesn't mean that the regular expression will only match if the repeated token is allowed to match the whole rest of the string. It will if it can, but it'll backtrack if it can't.

Even if greedy repetition with * backtracks down to zero length, there's no problem, because * means match zero or more times. (by contrast, repeating with +, if it backtracks down to zero length, the regular expression will fail completely, because + means that at least one repetition is required)

CertainPerformance
  • 356,069
  • 52
  • 309
  • 320
0

a* means zero or more "a" characters. a*a means "zero or more "a" characters followed by an "a". "aaa" is indeed "zero or more a characters followed by an "a" since "aa" meets the criteria of "zero or more".

Bryan Oakley
  • 370,779
  • 53
  • 539
  • 685