25

Note:
* Python is used to illustrate behaviors, but this question is language-agnostic.
* For the purpose of this discussion, assume single-line input only, because the presence of newlines (multi-line input) introduces variations in behavior of $ and . that are incidental to the questions at hand.

Most regex engines:

  • accept a regex that explicitly tries to match an expression after the end of the input string[1].

    $ python -c "import re; print(re.findall('$.*', 'a'))"
    [''] # !! Matched the hypothetical empty string after the end of 'a'
    
  • when finding / replacing globally, i.e., when looking for all non-overlapping matches of a given regex, and having reached the end of the string, unexpectedly try to match again[2], as explained in this answer to a related question:

    $ python -c "import re; print(re.findall('.*$', 'a'))"
    ['a', ''] # !! Matched both the full input AND the hypothetical empty string
    

Perhaps needless to say, such match attempts succeed only if the regex in question matches the empty string (and the regex by default / is configured to report zero-length matches).

These behaviors are at least at first glance counter-intuitive, and I wonder if someone can provide a design rationale for them, not least because:

  • it's not obvious what the benefit of this behavior is.
  • conversely, in the context of finding / replacing globally with patterns such as .* and .*$, the behavior is downright surprising.[3]
    • To ask the question more pointedly: Why does functionality designed to find multiple, non-overlapping matches of a regex - i.e., global matching - decide to even attempt another match if it knows that the entire input has been consumed already, irrespective of what the regex is (although you'll never see the symptom with a regex that doesn't at least also match the empty string)
    • The following languages/engines exhibit the surprising behavior: .NET, Python (both 2.x and 3.x)[2], Perl (both 5.x and 6.x), Ruby, Node.js (JavaScript)

Note that regex engines vary in behavior with respect to where to continue matching after a zero-length (empty-string) match.

Either choice (start at the same character position vs. start at the next) is defensible - see the chapter on zero-length matches at www.regular-expressions.info.

By contrast, the .*$ case discussed here is different in that, with any non-empty input, the first match for .*$ is not a zero-length match, so the behavior difference does not apply - instead, the character position should advance unconditionally after the first match, which of course is impossible if you're already at the end.
Again, my surprise is at the fact that another match is attempted nonetheless, even though there's by definition nothing left.


[1] I'm using $ as the end-of-input marker here, even though in some engines, such as .NET's, it can mark the end the end of the input optionally followed by a trailing newline. However, the behavior equally applies when you use the unconditional end-of-input marker, \z.

[2] Python 2.x and 3.x up to 3.6.x seemingly special-cased replacement behavior in this context: python -c "import re; print(re.sub('.*$', '[\g<0>]', 'a'))" used to yield just [a] - that is, only one match was found and replaced.
Since Python 3.7, the behavior is now like in most other regex engines, where two replacements are performed, yielding [a][].

[3] You can avoid the problem by either (a) choosing a replacement method that is designed to find at most one match or (b) use ^.* to prevent multiple matches from being found via start-of-input anchoring.
(a) may not be an option, depending on how a given language surfaces functionality; for instance, PowerShell's -replace operator invariably replaces all occurrences; consider the following attempt to enclose all array elements in "...":
'a', 'b' -replace '.*', '"$&"'. Due to matching twice, this yields elements "a""" and "b""";
option (b), 'a', 'b' -replace '^.*', '"$&"', fixes the problem.

ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 1
    The point here is that empty string (zero-length) matches are treated differently in different regex flavors as the behavior is not standardized, everyone solves it there own way. There is a very good reason for that as when you get an empty string match, you might still match the next char that is still at the same index in the string. If a regex engine did not support it, these matches would be skipped. Making an exception for the end of string was not probably that critical for regex engine authors. – Wiktor Stribiżew Sep 17 '18 at 14:27
  • Consider the difference between `$.` and `$.*` – dawg Sep 17 '18 at 15:20
  • @dawg: With single-line input, `$.` _never_ matches anything, `$.*` _always_ matches something, namely the empty string. With multi-line input, as noted, many engines interpret `$` as the last char. _before a single trailing newline_, so if `.` is configured to match `\n` too, `$.` would match that trailing `\n`. However, the single-line behavior does apply if you use the true, unconditional end-of-input anchor, such as `\z` for .NET. Given all that, what is your example meant to illustrate? – mklement0 Sep 17 '18 at 18:07
  • @WiktorStribiżew: So far, your statement that _Making an exception for the end of string was not probably that critical for regex engine authors_ comes closest to the _why_ that I'm looking for. However, matching at the end of the string is _not_ limited to engines that continue matching at the _same_ index after an empty match; if you're up for it, please review [my own answer](https://stackoverflow.com/a/52389660/45375) for accuracy. – mklement0 Sep 18 '18 at 20:07

6 Answers6

7

I am giving this answer just to demonstrate why a regex would want to allow any code appearing after the final $ anchor in the pattern. Suppose we needed to create a regex to match a string with the following rules:

  • starts with three numbers
  • followed by one or more letters, numbers, hyphen, or underscore
  • ends with only letters and numbers

We could write the following pattern:

^\d{3}[A-Za-z0-9\-_]*[A-Za-z0-9]$

But this is a bit bulky, because we have to use two similar character classes adjacent to each other. Instead, we could write the pattern as:

^\d{3}[A-Za-z0-9\-_]+$(?<!_|-)

or

^\d{3}[A-Za-z0-9\-_]+(?<!_|-)$

Here, we eliminated one of the character classes, and instead used a negative lookbehind after the $ anchor to assert that the final character was not underscore or hyphen.

Other than a lookbehind, it makes no sense to me why a regex engine would allow something to appear after the $ anchor. My point here is that a regex engine may allow a lookbehind to appear after the $, and there are cases for which it logically makes sense to do so.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • 1
    You are mixing up notions. `$` only asserts the position at the end of the string (or before the trailing newline in most engines) and the `(?<!_|-)` is a lookbehind that checks the text *before* the end of the string. This has nothing to do with the point that the end of string position can be matched twice. – Wiktor Stribiżew Sep 17 '18 at 14:49
  • @WiktorStribiżew: Tim's answer is a helpful response to the _first_ question in my answer; in hindsight I should have created two separate question posts. – mklement0 Sep 26 '18 at 02:32
4

Recall several things:

  1. ^ and $ are zero width assertions - they match right after the logical start of the string (or after each line ending in multiline mode with the m flag in most regex implementations) or at the logical end of string (or end of line BEFORE the end of line character or characters in multiline mode.)

  2. .* is potentially a zero length match of no match at all. The zero length only version would be $(?:end of line){0} DEMO (which is useful as a comment I guess...)

  3. . does not match \n (unless you have the s flag) but does match the \r in Windows CRLF line endings. So $.{1} only matches Windows line endings for example (but don't do that. Use the literal \r\n instead.)

There is no particular benefit other than simple side effect cases.

  1. The regex $ is useful;
  2. .* is useful.
  3. The regex ^(?a lookahead) and (?a lookbehind)$ are common and useful.
  4. The regex (?a lookaround)^ or $(?a lookaround) are potentially useful.
  5. The regex $.* is not useful and rare enough to not warrant implementing some optimization to have the engine stop looking with that edge case. Most regex engines do a decent job of parsing syntax; a missing brace or parenthesis for example. To have the engine parse $.* as not useful would require parsing meaning of that regex as different than $(something else)
  6. What you get will be highly dependent on the regex flavor and the status of the s and m flags.

For examples of replacements, consider the following Bash script output from some major regex flavors:

#!/bin/bash

echo "perl"
printf  "123\r\n" | perl -lnE 'say if s/$.*/X/mg' | od -c
echo "sed"
printf  "123\r\n" | sed -E 's/$.*/X/g' | od -c
echo "python"
printf  "123\r\n" | python -c "import re, sys; print re.sub(r'$.*', 'X', sys.stdin.read(),flags=re.M) " | od -c
echo "awk"
printf  "123\r\n" | awk '{gsub(/$.*/,"X")};1' | od -c
echo "ruby"
printf  "123\r\n" | ruby -lne 's=$_.gsub(/$.*/,"X"); print s' | od -c

Prints:

perl
0000000    X   X   2   X   3   X  \r   X  \n                            
0000011
sed
0000000    1   2   3  \r   X  \n              
0000006
python
0000000    1   2   3  \r   X  \n   X  \n                                
0000010
awk
0000000    1   2   3  \r   X  \n                                        
0000006
ruby
0000000    1   2   3   X  \n                                            
0000005
dawg
  • 98,345
  • 23
  • 131
  • 206
  • Re (3) For the purpose of this discussion, assume _single-line_ input only, the presence of newlines (multi-line) introduces variations in behavior of `$` and `.`, as you state, but these are _incidental_ to my questions - I've added this clarification to the question too. – mklement0 Sep 17 '18 at 18:54
  • 1
    Otherwise, that's all good information, but the question I'm looking to have answered is: With global matching, if a regex has by definition consumed the _entire_ input with the 1st match, why would the engine _continue to look_ for more matches? – mklement0 Sep 17 '18 at 19:03
  • I changed the multiline input to single line. re *why would the engine continue to look for more matches?* Because the definition of `$` alone is useful and the definition of `.*` alone is useful. The regex `$.*` is not useful. For a regex engine designer to come up with different behavior for that regex or optimize it out is probably not worth the effort. – dawg Sep 17 '18 at 19:18
  • Point taken re `$.*`, but my previous comments were about `.*` / `.*$` with _global_ matching / replacement (sorry, should have made that clear - in hindsight I should have asked two separate questions). – mklement0 Sep 17 '18 at 19:22
  • +1, but could you please add your previous comment to the top of your answer (saying that ruling out nonsensical `$` regexes probably wasn't worth the implementation effort)? – mklement0 Sep 17 '18 at 19:27
  • `. does not match \n (unless you have the s flag) but does match the \r in Windows CRLF line endings.` Do note that this is not universally true. `.` does not match `\n` in most language (unless it uses `s` flag by default). However, `.` does not match `\r` in many languages, JavaScript and Java are the 2 on top of my mind right now, probably PCRE as well. – nhahtdh Sep 21 '18 at 03:17
  • @nhahtdh: Please try `echo -e '\r' | perl -lane 'print "yes" if /./'` to see that the default in Perl is to match `\r` – dawg Sep 21 '18 at 14:11
  • @dawg: I only say that it's not universally true. I don't doubt that there are languages where `.` matches `\r`. – nhahtdh Sep 24 '18 at 02:17
  • @nhahtdh: 4 out of 5 of the examples `.` matches `\r`. Ruby is the only one that does not. Do you have more examples of regex engines where `.` does not match `\r`? – dawg Sep 24 '18 at 03:24
3

What is the reason behind using .* with global modifier on? Because someone somehow expects an empty string to be returned as a match or he / she isn't aware of what * quantifier is, otherwise global modifier shouldn't be set. .* without g doesn't return two matches.

it's not obvious what the benefit of this behavior is.

There shouldn't be a benefit. Actually you are questioning zero-length matches existence. You are asking why does a zero-length string exist?

We have three valid places that a zero-length string exists:

  • Start of subject string
  • Between two characters
  • End of subject string

We should look for the reason rather than the benefit of that second zero-length match output using .* with g modifier (or a function that searches for all occurrences). That zero-length position following an input string has some logical uses. Below state diagram is grabbed from debuggex against .* but I added epsilon on the direct transition from start state to accept state to demonstrate a definition:

enter image description here
(source: pbrd.co)

That's a zero-length match (read more about epsilon transition).

These all relates to greediness and non-greediness. Without zero-length positions a regex like .?? wouldn't have a meaning. It doesn't attempt the dot first, it skips it. It matches a zero-length string for this purpose to transit the current state to a temporary acceptable state.

Without a zero-length position .?? never could skip a character in input string and that results in a whole brand new flavor.

Definition of greediness / laziness leads into zero-length matches.

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
revo
  • 47,783
  • 14
  • 74
  • 117
  • _What is the reason behind using .* with global modifier on?_ As stated in footnote [3] in the question, in some languages, regex-based features _by default and invariably_ use _global_ matching, so while if you had to _choose_ you wouldn't _opt-into_ global matching, you sometimes can't _opt-out_. – mklement0 Sep 17 '18 at 18:49
  • Matching globally is an extra feature that could be enabled in almost all languages. It doesn't invariably happens even in powershell there would be an option or separate function that works the other way. – revo Sep 17 '18 at 18:55
  • Yes, it happens _invariably_ with `-replace` in PowerShell (and while you can _work around that_ with direct use of of the `Regex` type, that is beside the point). All that aside, my question stands: with global matching, if a regex has by definition consumed the _entire_ input, why would it _continue to look_ for more matches? As such, my question is NOT _why does a zero-length string exist?_ – mklement0 Sep 17 '18 at 18:58
  • 1
    *if a regex has by definition consumed the entire input, why would it continue to look for more matches?* there are two reasons: 1) there exists zero-length positions. 2) you have global modifier on. – revo Sep 17 '18 at 19:00
  • Re 2): Using the global modifier is the _premise_ of my question. In other words: _With the global modifier in effect_, why does it behave this way? Re (2): That the end-of-subject-string position arrived at _after_ a _nonempty first match_ would be considered as requiring _another_ match attempt is what doesn't make sense to me. The nonempty first match would normally result in unconditional advancing to the next char., which is obviously impossible at the end of the string. You could therefore consider the 2nd, empty match to be _overlapping_ with the first, violating global matching rules. – mklement0 Sep 17 '18 at 19:12
  • *With the global modifier in effect, why does it behave this way?* global modifier doesn't have any effect on regex itself. It just makes engine to continue matching against the regex. Hence `.*` matches end of subject string position. A zero-length position never overlaps until its position is matched and the state is saved. – revo Sep 17 '18 at 19:19
  • @mklement0 Why do you expect `.*` to match against an empty string? 1) Because input string does have no characters and `.*` could match nothing. 2) It's end of subject string and `.*` matches zero-length positions. 3) It's start of subject string and `.*` matches zero-length positions. 4) All three. – revo Sep 17 '18 at 19:27
  • `.*` does and should match the empty string - no argument there. My question is unrelated to empty strings. My question is why functionality designed to find multiple, non-overlapping matches of a regex decides to even _attempt_ another match if it knows that _the entire input has been consumed already_, irrespective of what the regex is (you'll never see the symptom with a regex that doesn't also match the empty string, however). – mklement0 Sep 17 '18 at 19:32
  • @mklement0 No, unfortunately you didn't get the point by that question. Anyhow, in all those regex engines that returns one extra match the greedy quantifier didn't consume the end of subject string position. It's left there. There is an helpful article here you may want to read https://www.dyalog.com/blog/2015/02/zero-length-regular-expression-matches-considered-harmful/ – revo Sep 17 '18 at 19:38
  • The article doesn't contain anything not already discussed. We've established that "Searching then resumes" is the _de facto_ behavior. While this makes sense in the context of _alternation_, it doesn't in the context of _global matching_ - and we still don't know why the latter behavior was chosen. – mklement0 Sep 17 '18 at 19:52
  • The article starts to give a reason where it says "If it seems odd...*. And I'd like to know what is your understanding of *global matching*. – revo Sep 17 '18 at 19:58
  • Yes, the article states that, but, like your answer, it simply documents the existing behavior from a technical standpoint without giving a _rationale_ for it. By _global_ I mean finding all non-overlapping matches of a given regex. The current behavior is justifiable if that regex is an _alternation_ that _also_ matches the empty string, because then a _single instance of matching_ can yield multiple matches. By contrast, I expect the conceptually _separate_ logic related to _sequencing_ matches to never even _attempt_ another match once all input has been consumed. – mklement0 Sep 17 '18 at 20:19
  • @mklement0 You *guess* it overlaps. Your guess is wrong. In fact nothing overlaps. `.*` matches up to the end. Engine is satisfied. `g` is set. Engine continues. There is nothing left but a position. Engine tries to match it against `.*`. `.*` matches. Engine is satisfied. `g` is set. Engine continues. There is no position left. Engine stops. – revo Sep 17 '18 at 20:32
  • 1
    @mklement0 Also according to your own reasoning I want to know what do you expect from alternation if we switch sides `'a' -replace '.*|a', '[$&]'`? Is it defensible yet? (`a` never matches) – revo Sep 17 '18 at 20:40
  • I've cleaned up some of my comments, because I need to rethink some of my conclusions. Thanks for an interesting discussion so far; +1 for your answer. As an aside re _Engine continues. There is no position left. Engine stops_: Is there such a thing as _no position left_? Or is this merely the case of the engine explicitly preventing finding an empty match _again_ at the (unchanged) end-of-subject-string position? – mklement0 Sep 18 '18 at 02:25
  • It could be both. But in this case it depends on the engine. The latter *preventing finding an empty match again* would be the one that probably some engines look for. – revo Sep 18 '18 at 04:45
  • I've posted [an answer of my own](https://stackoverflow.com/a/52389660/45375) (which references yours) - if you're up for it, I'd appreciate your input on it with respect to technical accuracy. – mklement0 Sep 18 '18 at 20:09
2

Note:

  • My question post contains two related, but distinct questions, for which I should have created separate posts, as I now realize.
  • The other answers here focus on one of the questions each, so in part this answer provides a road map to what answers address which question.

As for why patterns such as $<expr> are allowed (i.e., matching something after the input's end) / when they make sense:

  • dawg's answer argues that nonsensical combinations such as $.+ probably aren't prevented for pragmatic reasons; ruling them out may not be worth the effort.

  • Tim's answer shows how certain expressions can make sense after $, namely negative lookbehind assertions.

  • The second half of ivan_pozdeev's answer answer cogently synthesizes dawg's and Tim's answers.


As for why global matching finds two matches for patterns such as .* and .*$:

  • revo's answer contains great background information about zero-length (empty-string) matching, which is what the problem ultimately comes down to.

Let me complement his answer by relating it more directly to how the behavior contradicts my expectations in the context of global matching:

  • From a purely common-sense perspective, it stands to reason that once the input has been fully consumed while matching, there is by definition nothing left, so there is no reason to look for further matches.

  • By contrast, most regex engines consider the character position after the last character of the input string - the position known as end of subject string in some engines - a valid starting position for a match and therefore attempt another one.

    • If the regex at hand happens to match the empty string (produces a zero-length match; e.g., regexes such as .*, or a?), it matches that position and returns an empty-string match.

    • Conversely, you won't see an extra match if the regex doesn't (also) match the empty string - while the additional match is still attempted in all cases, no match will be found in this case, given that the empty string is the only possible match at the end-of-subject-string position.

While this provides a technical explanation of the behavior, it still doesn't tell us why matching after the last character was implemented.

The closest thing we have is an educated guess by Wiktor Stribiżew in a comment (emphasis added), which again suggests a pragmatic reason for the behavior:

... as when you get an empty string match, you might still match the next char that is still at the same index in the string. If a regex engine did not support it, these matches would be skipped. Making an exception for the end of string was probably not that critical for regex engine authors.

The first half of ivan_pozdeev's answer explains the behavior in more technical detail by telling us that the void at the end of the [input] string is a valid position for matching, just like any other character-boundary position.
However, while treating all such positions the same is certainly internally consistent and presumably simplifies the implementation, the behavior still defies common sense and has no obvious benefit to the user.


Further observations re empty-string matching:

Note: In all code snippets below, global string replacement is performed to highlight the resulting matches: each match is enclosed in [...], whereas non-matching parts of the input are passed through as-is.

In summary, 3 different, independent behaviors apply in the context of empty(-string) matches, and different engines use different combinations:

  • Whether the POSIX ERE spec's longest leftmost ruleThanks, revo. is obeyed.

  • In global matching:

    • Whether or not the character position is advanced after an empty match.
    • Whether or not another match is attempted for the by-definition empty string at the very end of the input (the 2nd question in my question post).

Matching at the end-of-subject-string position is not limited to those engines where matching continues at the same character position after an empty match.

For instance, the .NET regex engine does not do so (PowerShell example):

PS> 'a1' -replace '\d*|a', '[$&]'
[]a[1][]

That is:

  • \d* matched the empty string before a
  • a itself then did not match, which implies that the character position was advanced after the empty match.
  • 1 was matched by \d*
  • The end-of-subject-string position was again matched by \d*, resulting in another empty-string match.

Perl 5 is an example of an engine that does resume matching at the same character position:

$ "a1" | perl -ple "s/\d*|a/[$&]/g"
[][a][1][]

Note how a was matched too.

Interestingly, Perl 6 not only behaves differently, but exhibits yet another behavior variant:

$ "a1" | perl6 -pe "s:g/\d*|a/[$/]/"
[a][1][]

Seemingly, if an alternation finds both and empty and a non-empty match, only the non-empty one is reported.

Perl 6's behavior appears to be following the longest leftmost rule.

While sed and awk do as well, they don't attempt another match at the end of the string:

sed, both the BSD/macOS and GNU/Linux implementations:

$ echo a1 | sed -E 's/[0-9]*|a/[&]/g'
[a][1]

awk - both the BSD/macOS and GNU/Linux implementations as well as mawk:

$ echo a1 | awk '1 { gsub(/[0-9]*|a/, "[&]"); print }'
[a][1]
mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 2
    There is a rule in regular expressions world called *leftmost longest match*. It seems Perl 6 follows it. It's a POSIX standard. Sed and awk do follow as well. `\d*` doesn't produce a match at offset `0` because `a` at the other side will produce a match longer than `\d*`. Overall, it's a good summed up answer. However some statements aren't backed by authoritative references e.g. that from dawg or the one from Wiktor Stribiżew. – revo Sep 18 '18 at 20:33
  • @revo Posix's leftmost longest match seems pretty important and should be part of an answer I think. Yet another wrinkle in computing. – js2010 May 01 '21 at 14:51
  • @js2010, I don't think the behavior necessarily contradicts the POSIX ERE spec's [leftmost longest rule](https://www.boost.org/doc/libs/1_64_0/libs/regex/doc/html/boost_regex/syntax/leftmost_longest_rule.html), as it applies to a _single_ act of matching. By contrast, the question at hand is about why, in _global_ matching, _another_ match is attempted, even though the string has already been consumed in full. – mklement0 May 01 '21 at 15:36
  • @revo, I've added additional examples to the bottom section, and I conclude that obeying the leftmost longest match rule is independent of the match-again-at-the-end-of-the-string behavior: All of Perl 6, `sed` and `awk` seem to obey the leftmost longest rule, but only Perl 6 (and Perl 5, which doesn't obey the leftmost longest rule) also matches again at the end of the string. – mklement0 May 01 '21 at 16:26
1

"Void at the end of the string" is a separate position for regex engines because a regex engine deals with positions between input characters:

|a|b|c|   <- input line

^ ^ ^ ^
positions at which a regex engine can "currently be"

All other positions can be described as "before Nth character" but for the end, there's no character to refer to.

As per Zero-Length Regex Matches -- Regular-expressions.info, it's also needed to support zero-length matches (which not all regex flavors support):

  • E.g. a regex \d* over string abc would match 4 times: before each letter, and at the end.

$ is allowed anywhere in the regex for uniformity: it's treated the same as any other token and matches at that magical "end of string" position. Making it "finalize" the regex work would lead to an unnecessary inconsistency in engine work and prevent other useful things that can match there, like e.g. lookbehind or \b (basically, anything that can be a zero-length match) -- i.e. would be both a design complication and a functional limitation with no benefit whatsoever.


Finally, to answer why a regex engine may or may not try to match "again" at the same position, let's refer to Advancing After a Zero-Length Regex Match -- Zero-Length Regex Matches -- Regular-expressions.info:

Say we have the regex \d*|x, the subject string x1

The first match is a blank match at the start of the string. Now, how do we give other tokens a chance while not getting stuck in an infinite loop?

The simplest solution, which is used by most regex engines, is to start the next match attempt one character after the end of the previous match

This may give counterintuitive results -- e.g. the above regex will match '' at start, 1 and '' at the end -- but not x.

The other solution, which is used by Perl, is to always start the next match attempt at the end of the previous match, regardless of whether it was zero-length or not. If it was zero-length, the engine makes note of that, as it must not allow a zero-length match at the same position.

Which "skips" matches less at the cost of some extra complexity. E.g. the above regex will produce '', x, 1 and '' at the end.

The article goes on to show that there aren't established best practices here and various regex engines are actively trying new approaches to try and produce more "natural" results:

One exception is the JGsoft engine. The JGsoft engine advances one character after a zero-length match, like most engines do. But it has an extra rule to skip zero-length matches at the position where the previous match ended, so you can never have a zero-length match immediately adjacent to a non-zero-length match. In our example the JGsoft engine only finds two matches: the zero-length match at the start of the string, and 1.

Python 3.6 and prior advance after zero-length matches. The gsub() function to search-and-replace skips zero-length matches at the position where the previous non-zero-length match ended, but the finditer() function returns those matches. So a search-and-replace in Python gives the same results as the Just Great Software applications, but listing all matches adds the zero-length match at the end of the string.

Python 3.7 changed all this. It handles zero-length matches like Perl. gsub() does now replace zero-length matches that are adjacent to another match. This means regular expressions that can find zero-length matches are not compatible between Python 3.7 and prior versions of Python.

PCRE 8.00 and later and PCRE2 handle zero-length matches like Perl by backtracking. They no longer advance one character after a zero-length match like PCRE 7.9 used to do.

The regexp functions in R and PHP are based on PCRE, so they avoid getting stuck on a zero-length match by backtracking like PCRE does. But the gsub() function to search-and-replace in R also skips zero-length matches at the position where the previous non-zero-length match ended, like gsub() in Python 3.6 and prior does. The other regexp functions in R and all the functions in PHP do allow zero-length matches immediately adjacent to non-zero-length matches, just like PCRE itself.

ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
  • Thanks (+1); the expression-after-`$` explanation makes sense to me (in a sense it cogently synthesizes Tim's and dawg's answers). (A potential benefit of preventing non-sensical patterns is to alert the user to that fact, but I get that may not be worth it.) – mklement0 Sep 27 '18 at 13:16
  • As for the matching-again-at-the-end issue: The .NET regex engine is an example of _not_ matching at the _same_ position again after an empty match, yet it also matches again at the end of the string after the _nonempty_ match that is the premise of my question. (In fact, it is matching at the same position again after an empty match that introduces complexity, because you then need to prevent an inifite loop). Purely logically, _after the last character_ doesn't sound like _between_ characters to me, because there is no 2nd reference point. So why treat them the same? – mklement0 Sep 27 '18 at 13:20
  • @mklement0 I've found another useful case that can match at the end – ivan_pozdeev Sep 28 '18 at 02:36
  • @mklement0 _"after the last character doesn't sound like between characters to me"_ -- say it as "at character boundary" if that makes you feel better :-) – ivan_pozdeev Sep 28 '18 at 02:37
  • True, if you go by _character boundaries_ (which also includes the position _before the first_ character), then treating them all the same is _internally consistent_. But it still defies common sense with respect to deciding whether to _match again_. I get the behavior variance around how to proceed after an _empty_ match, but the premise of my (2nd) question is a _nonempty_ first match, where the major engines (.Net, Node.js, Python 2/3, Ruby, Perl, Perl 6, PCRE) act the same. (And with an _empty_ 1st match at the end, finding another is prevented by the infinite-loop prevention logic.) – mklement0 Sep 28 '18 at 13:58
  • Perhaps it really does come down to what Wiktor said: " Making an exception for the end of string was not probably that critical for regex engine authors." – mklement0 Sep 28 '18 at 14:02
  • @mklement0 "Making an exception for the end of string" would be a bug 'cuz as you've seen, there are regexes that can match at the end of the string. And for consistency/predictability/uniformity, the engine treats all regexes the same 'cuz trying to determine if a regex is a "good"-regex-for-an-end-of-string is a lost cause. – ivan_pozdeev Sep 28 '18 at 14:36
  • To be clear: I'm not talking about my _first_ question (why a regex like `$foo` is allowed): I find your answer to _that_ question satisfactory and I get that preventing users from formulating syntactically valid, but nonsensical patterns may not be worth the effort. With respect to my _second_ question: There's no need to determine anything other than whether you've reached the end of the string _after the first match_ and _stop matching, if so_ - whatever the regex pattern is. If you don't stop, you may get the additional, useless, always-empty match that prompted the (2nd) question. – mklement0 Sep 28 '18 at 14:52
  • @mklement0 The fact is, that "additional, always-empty match" is not necessarily "useless", so you can't throw it away blindly. And to throw it away non-blindly, you need to somehow define which regexes are "good" vs "bad" for end-of-string matching -- in a way users can agree with and reasonably predict. Not only would this criterion be subjective, it'd be (nigh-?)impossible to check 'cuz regexes can be infinitely complex and interconnected. – ivan_pozdeev Sep 28 '18 at 15:04
  • You don't need to consider the specifics of the pattern at all, if you follow the simple and straightforward algorithm mentioned in my previous comment: After a successful match (whatever its result), if you've reached the end of the input, stop matching. – mklement0 Sep 28 '18 at 15:12
  • @mklement0 how many times do I have to say that you can't do that because there are regexes that one would legitimately want to match both the last character(s) and the end of the string? E.g. `([^-]*)(-|$)` for `ab-c-d-` – ivan_pozdeev Sep 28 '18 at 15:33
  • Your example yields the following matches: `ab-`, `c-`, `d-`, ``. I neither expect nor would I have any use for that final, empty match. What problem do you see with omitting it or, conversely, why should it be there? – mklement0 Sep 28 '18 at 22:57
  • @mklement0 The line is a series or words separated by dashes (a series of elements divided by a separator is a common data format). The 1st group in a match contains the word. The last one is an empty word. A more illustrative example would probably be `ab-c--d-`. – ivan_pozdeev Sep 29 '18 at 01:00
  • Your revised example gives us the following matches: `ab-`, `c-`, `-`, `d-`, ``. Per match, the 1st group contains the word, or the empty string if the previous char. or start of the string was `-` too. The 2nd group contains the `-` or the empty string if the nonempty word isn't followed by a `-` at the end of the string. The final match is an empty string, overall, and both its groups are therefore empty, too. The final, all-empty match still makes no sense to me. – mklement0 Sep 29 '18 at 01:54
  • @mklement0 Hopefully this will make more sense: `[gg[0] for gg in re.findall(r'([^-]*)(-|$)', "ab-c--d-")]`. This will extract all the elements, whether empty or not. Try to match other strings if it's still unclear. – ivan_pozdeev Sep 29 '18 at 02:10
  • If we change `gg[0]` to `gg`, we see what the individual matches' 2 capture groups captured individually (concatenating the two capture-group matches will give you each match's overall value): `[('ab', '-'), ('c', '-'), ('', '-'), ('d', '-'), ('', '')]`. The last, all-empty match still doesn't make to me - clearly, the input has been fully consumed with the penultimate match. – mklement0 Sep 29 '18 at 02:21
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/180971/discussion-between-ivan-pozdeev-and-mklement0). – ivan_pozdeev Sep 29 '18 at 02:24
-1

I don't know where the confusion comes from.
Regex engines are basically stupid.
They're like Mikey, they'll eat anything.

$ python -c "import re; print(re.findall('$.*', 'a'))"
[''] # !! Matched the hypothetical empty string after the end of 'a'

You could put a thousand optional expressions after $ and it will still match the
EOS. Engines are stupid.

$ python -c "import re; print(re.findall('.*$', 'a'))"
['a', ''] # !! Matched both the full input AND the hypothetical empty string

Think of it this way, there are two independent expressions here
.* | $. The reason is the first expression is optional.
It just happens to butt against the EOS assertion.
Thus you get 2 matches on a non-empty string.

Why does functionality designed to find multiple, non-overlapping matches of a regex - i.e., global matching - decide to even attempt another match if it knows that the entire input has been consumed already,

The class of things called assertions don't exist at character positions.
They exist only BETWEEN character positions.
If they exist in the regex, you don't know if the entire input has been consumed.
If they can be satisfied as an independent step, but only once, they will match
independently.

Remember, regex is a left-to-right proposition.
Also remember, engines are stupid.
This is by design.
Each construct is a state in the engine, it's like a pipeline.
Adding complexity will surely doom it to failure.

As an aside, does .*a actually start from the beginning and check each character ?
No. .* immediately starts at the end of string (or line, depending) and starts
backtracking.

Another funny thing. I see a lot of novices using .*? at the end of their
regex, thinking it will get all the remaining kruft from the string.
It's useless, it will never match anything.
Even a standalone .*? regex will always match nothing for as many characters
there are in the string.

Good luck! Don't fret it, regex engines are just ... well, stupid.

  • Thanks, but _regex engines are stupid_ isn't a satisfying explanation. As for the find-all/replacement behavior: just `.*` by itself produces the same result, I just added the `$` to make it more obvious that I want _everything_ to be matched. Not attempting another match at what is by definition the _end of the input_ is not adding complexity. As stated, the behavior also surfaces even _without_ assertions, but why wouldn't the engine know that something that matched with `$` (at least with single-line input) has consumed _all_ of the input? `$` matches _after_ the last char., right? – mklement0 Sep 27 '18 at 01:38
  • @mklement0 - Hey bud, sorry you feel that way, was just being honest. In my defense, I did add a boat load of stuff. I think `.*$` is a good example. If `.*` can match it is on the condition of `$`, where `$` itself is not matched as a stand alone item. `$` by itself, can only be matched at one place at a time. Even more bizzaar, `$` can be matched before a newline _or_ after. With this target `"abc\n"`, and using `.*$` there are actually 3 matches. "`abc<1>`\n" the abc, "abc<1>`<2>`\n" before the newline, "abc<1><2>\n`<3>`" after the newline. Good luck ! https://regex101.com/r/YNRSJk/1 –  Sep 28 '18 at 01:08
  • I was being honest, too - there are no hard feelings: I appreciate your efforts, it just so happens that they didn't convince me. The premise of my question was _single-line_ input, so that `$` by definition matches the very end of the input (.NET has `\z` to match the absolute end of input for multi-line input, for instance, but I didn't want to get into that). My puzzlement is still unresolved: why match (again) at the very end of the input, which is not a _between_-characters position, because _no character comes after_. – mklement0 Sep 28 '18 at 01:58
  • @mklement0 - The prime directive of engines is to never match at the same position twice. Matches are consumed based on _the starting position_. The only relevance of the ending position is that it is the start of the next match which begins between the end of the last match and the next character. This happens to be the physical location of `$` in this case, and cannot be consumed. So, the engine sets the new location as described, sees that it can ignore `.*` and goes ahead and matches `$`. Its really as simple as that. –  Sep 29 '18 at 11:54