Why do some regex engines match .* twice in a single input string?

Question

Many regex engines match .* twice in a single-line string, e.g., when performing regex-based string replacement:

The 1st match is - by definition - the entire (single-line) string, as expected.
In many engines there is a 2nd match, namely the empty string; that is, even though the 1st match has consumed the entire input string, .* is matched again, which then matches the empty string at the end of the input string.
- Note: To ensure that only one match is found, use ^.*

My questions are:

Is there a good reason for this behavior? Once the input string has been consumed in full, I wouldn't expect another attempt to find a match.
Other than trial and error, can you glean from the documentation / regex dialect/standard supported which engines exhibit this behavior?

^{Update: revo's helpful answer explains the how of the current behavior; as for the potential why, see this related question.}

Languages/platforms that DO exhibit the behavior:

 # .NET, via PowerShell (behavior also applies to the -replace operator)
 PS> [regex]::Replace('a', '.*', '[$&]'
 [a][]  # !! Note the *2* matches, first the whole string, then the empty string

 # Node.js
 $ node -pe "'a'.replace(/.*/g, '[$&]')"
 [a][]

 # Ruby
 $ ruby -e "puts 'a'.gsub(/.*/, '[\\0]')"
 [a][]

 # Python 3.7+ only
 $ python -c "import re; print(re.sub('.*', '[\g<0>]', 'a'))"
 [a][] 

 # Perl 5
 $ echo a | perl -ple 's/.*/[$&]/g'
 [a][] 

 # Perl 6
 $ echo 'a' | perl6 -pe 's:g/.*/[$/]/'
 [a][]

 # Others?

Languages/platforms that do NOT exhibit the behavior:

# Python 2.x and Python 3.x <= 3.6
$ python -c "import re; print(re.sub('.*', '[\g<0>]', 'a'))"
[a]  # !! Only 1 match found.

# Others?

bobble bubble brings up some good related points:

If you make it lazy like .*?, you'd even get 3 matches in some and 2 matches in others. Same with .??. As soon as we use a start anchor, I thought we should get only one match, but interestingly it seems ^.*? gives two matches in PCRE for a, whereas ^.* should result in one match everywhere.

Here's a PowerShell snippet for testing the behavior across languages, with multiple regexes:

Note: Assumes that Python 3.x is available as python3 and Perl 6 as perl6.
You can paste the whole snippet directly on the command line and recall it from the history to modify the inputs.

& {
  param($inputStr, $regexes)

  # Define the commands as script blocks.
  # IMPORTANT: Make sure that $inputStr and $regex are referenced *inside "..."*
  #            Always use "..." as the outer quoting, to work around PS quirks.
  $cmds = { [regex]::Replace("$inputStr", "$regex", '[$&]') },
          { node -pe "'$inputStr'.replace(/$regex/g, '[$&]')" },
          { ruby -e "puts '$inputStr'.gsub(/$regex/, '[\\0]')" },
          { python -c "import re; print(re.sub('$regex', '[\g<0>]', '$inputStr'))" },
          { python3 -c "import re; print(re.sub('$regex', '[\g<0>]', '$inputStr'))" },
          { "$inputStr" | perl -ple "s/$regex/[$&]/g" },
          { "$inputStr" | perl6 -pe "s:g/$regex/[$/]/" }

  $regexes | foreach {
    $regex = $_
    Write-Verbose -vb "----------- '$regex'"
    $cmds | foreach { 
      $cmd = $_.ToString().Trim()
      Write-Verbose -vb ('{0,-10}: {1}' -f (($cmd -split '\|')[-1].Trim() -split '[ :]')[0], 
                                           $cmd -replace '\$inputStr\b', $inputStr -replace '\$regex\b', $regex)
      & $_ $regex
    }
  }

} -inputStr 'a' -regexes '.*', '^.*', '.*$', '^.*$', '.*?'

Sample output for regex ^.*, which confirms bobble bubble's expectation that using the start anchor (^) yields only one match in all languages:

VERBOSE: ----------- '^.*'
VERBOSE: [regex]   : [regex]::Replace("a", "^.*", '[$&]')
[a]
VERBOSE: node      : node -pe "'a'.replace(/^.*/g, '[$&]')"
[a]
VERBOSE: ruby      : ruby -e "puts 'a'.gsub(/^.*/, '[\\0]')"
[a]
VERBOSE: python    : python -c "import re; print(re.sub('^.*', '[\g<0>]', 'a'))"
[a]
VERBOSE: python3   : python3 -c "import re; print(re.sub('^.*', '[\g<0>]', 'a'))"
[a]
VERBOSE: perl      : "a" | perl -ple "s/^.*/[$&]/g"
[a]
VERBOSE: perl6     : "a" | perl6 -pe "s:g/^.*/[$/]/"
[a]

Interesting question; One thing to think about, assume you've got an empty string and the regex `.*`, would you expect a true or a false for a match-test? — tkausl, Sep 16 '18 at 05:31
` String "test" = "test" + "" ` that is why it is matching at the end — The Scientific Method, Sep 16 '18 at 05:41
@tkausl: I definitely expect `.*` to match the empty string, and all the languages mentioned in the question do - but they (sensibly) only find _one_ match in that case. — mklement0, Sep 16 '18 at 05:45
@TheScientificMethod: Once the input string has been consumed in full, why would you treat the fact that there is nothing left as the empty string? And if you do, shouldn't this result in an infinite loop? By this logic there is always an empty string left. — mklement0, Sep 16 '18 at 05:53

score 6 · Accepted Answer · edited Mar 25 '21 at 23:03

6

Kinda interesting question. Instead of referring to your questions first, I'll go for your comment.

Once the input string has been consumed in full, why would you treat the fact that there is nothing left as the empty string?

A position called end of subject string is left. It's a position and can be matched. Like other zero-width assertions and anchors \b, \B, ^, $... that assert, a dot-star .* can match an empty string. This is highly dependent on regex engine. E.g. TRegEx does it differently.

And if you do, shouldn't this result in an infinite loop?

No, this is of the main jobs of regex engines to handle. They raise a flag and store current cursor data to avoid such loops to occur. Perl docs explain it this way:

A common abuse of this power stems from the ability to make infinite loops using regular expressions, with something as innocuous as:
'foo' =~ m{ ( o? )* }x;
The o? matches at the beginning of foo, and since the position in the string is not moved by the match, o? would match again and again because of the * quantifier. Another common way to create a similar cycle is with the looping modifier /g...

Thus Perl allows such constructs, by forcefully breaking the infinite loop. The rules for this are different for lower-level loops given by the greedy quantifiers *+{} , and for higher-level ones like the /g modifier or split() operator.

The lower-level loops are interrupted (that is, the loop is broken) when Perl detects that a repeated expression matched a zero-length substring.

Now back to your questions:

Is there a good reason for this behavior?

Yes, there is. Every regex engine has to meet a significant amount of challenges in order to process a text. One of which is dealing with zero-length matches. Your question raises another question,

Q: How does an engine should proceed after matching a zero-length string?

A: It all depends.

PCRE (or Ruby here) doesn't skip zero-length matches.

It matches it then raises a flag to not match the same position again with the (same)? pattern. In PCRE .* matches entire subject string then stops right after it. Being at the end, current position is a meaningful position in PCRE, positions can be matched or being asserted so there is a position (zero-length string) left to be matched. PCRE goes through the regex again (if g modifier is enabled) and finds a match at the end of subject.

PCRE then tries to advance to the next immediate position to run whole process again but it fails since there is no position left.

You see if you want to prevent the second match from being happened you need to tell engine in some way:

^.*

Or to provide a better insight into what is going on:

(?!$).*

See live demo here specially take a look at debugger window.

edited Mar 25 '21 at 23:03

mklement0

382,024
64
607
775

answered Sep 16 '18 at 09:28

revo

47,783
14
74
117

Thanks for the interesting background info, @revo, especially the regular-expressions.info link about zero-length matches. As an aside re anchors: Python 2x. and 3.x don't match `.*\b` at all; all others (from my question and on regex101.com) still yield _two_ matches, except for Ruby, which yields _one_. That you would still get _two_ matches with `.*$` seems especially counter-intuitive. – mklement0 Sep 16 '18 at 17:12
So the fact that another match is attempted even at the _end of subject string_ position is the crux - which is not tied to whether zero-length matches are skipped altogether, as the following Python 2.x example shows: `python -c "import re; print(re.sub('\d*', '[\g<0>]', 'a1'))"` -> `[]a[1]`; i.e., a zero-width match was found at the start but not the end. All languages from my question and the additional ones on regex101.com (PCRE, golang) _do_ report zero-length matches in principle by default. Among them, only Python 2.x and Python 3.6- do _not_ attempt to match again at the end. – mklement0 Sep 16 '18 at 17:41
Can I suggest you restructure your answer to put the end-of-subject-string information first, followed by the more general information about zero-width matches? I'll gladly accept your answer then. Personally, the part I'm still unclear on is the _design rationale_ for this behavior: when would it ever be _useful_ to report an empty match _after_ the input string has been fully matched? It seems to me that when you're looking for repeated, non-overlapping matches, the more sensible approach is to not even _attempt_ another match in that case. – mklement0 Sep 16 '18 at 17:57
@mklement0 I'm going to reply to your comments in order. 1) On which input string did you expect `.*\b` to yield two matches in python? Why having two matches with `.*$` is counter-intuitive? 2) I don't have python 2.x but regex101 shows a match at the end (however they may be running another version). This behavior is all engine related. There is no standard, no rule of thumb. It would be possible that POSIX has but I'm not sure. 3) Edited. In fact the design rationale correspond to all other behaviors: zero-length matching, ... – revo Sep 16 '18 at 19:22
@mklement0 ... type of advancing from zero-length matches, preventing from all those loops... and other peculiarities. – revo Sep 16 '18 at 19:22
Thanks for updating; Re (1) any non-empty string, such as (a) - .NET, Node.js, and Perl 5 all return two matches. Re (2) It is counter-intuitive, because by ending your regex with `$` you're asking the regex to match _at the very end_ (if single-line), so you wouldn't expect _another_ match. As stated, Python behavior changed in 3.7. Re (3) Given that the _first_ match is _not_ zero-length, the advancement/infinite loop behavior only comes into play _once another match has been attempted_, and my surprise is all about why another match is even attempted in this scenario. – mklement0 Sep 16 '18 at 19:56
@mklement0 You're welcome. 1) `.*\b` against `a` - in a flavor that supports `\b` - should have a match (or two) regardless of regex flavor. Could you provide a live demo for when it fails? 2) `$` doesn't consume anything. It's an assertion. At first `.*$` matches whole input string since without `$` a dot-star consumes every character up the end, adding that anchor doesn't change anything. Then the same happens for the end of subject string. The point is `$` anchor doesn't consume. E.g. what do you expect by running `$$$`? 3) Because of global modifier? – revo Sep 17 '18 at 03:11
Re (1) I don't have an online repro, but you can see the problem with `python -c "import re; print(re.sub('.*\b', '[\g<0>]', 'a'))"`, both with Python 2.7.15 and Python 3.7.0. Re (2) and (3) It's not about _consuming_ anything, it's about _continuing to match_ even after something that by definition matched _the very end_ of the string - see https://stackoverflow.com/q/52369618/45375 – mklement0 Sep 17 '18 at 14:05
regex101 Python mystery solved: they don't actually use Python, but an - imperfect - _emulation of Python on top of PCRE_ - see https://github.com/firasdib/Regex101/wiki/FAQ#how-close-does-regex101-emulates-the-engines – mklement0 Sep 28 '18 at 22:22

Why do some regex engines match .* twice in a single input string?

1 Answers1

PCRE (or Ruby here) doesn't skip zero-length matches.

Linked

Related