Tiebreaker for same-length regex alternatives with same starting position

Question

Using GNU sed (with the -r flag for clarity), the following two substitutions on the input string ab give the same result:

s/(.)(.)|(.)(.)$/\2\1\3\4/

and

s/(.)(.)$|(.)(.)/\1\2\4\3/

both give ba. It would appear that the alternative (.)(.) (the one without $) succeeds in both substitutions, regardless of whether its position as the first or second alternative. Why is this the case? What is the tie-breaker for such alternatives?

The POSIX specification of regular expressions specifies¹ the tiebreaker for when the alternatives start at different positions (in which case the earlier one is favoured), and when they start at the same position but have different lengths (the longer one is favoured), but it does not appear to specify the behaviour of capturing groups when two alternatives start at the same position and have the same length, thus leaving it to the specific implementation.

The search for a matching sequence starts at the beginning of a string and stops when the first sequence matching the expression is found, where "first" is defined to mean "begins earliest in the string". If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the longest such sequence is matched. [...] – The Open Group Base Specifications Issue 7, 2018 edition

Here is a running example of the phenomenon.

echo ab|sed -r 's/(.)(.)|(.)(.)$/\2\1\3\4/'
echo ab|sed -r 's/(.)(.)$|(.)(.)/\1\2\4\3/'

Try it online!

interesting, wonder what's the result with other sed implementations.. if all of them give same results, may be there's some logic applied.. — Sundeep, Jan 10 '20 at 14:57
As the word _tie_ does not appear in the linked page, I guess your long last paragraph is your understanding of a section of the linked page. I think it'd be good if you could quote the specific text you are referring to. +1 to this very interesting question! — Enlico, Jan 12 '20 at 20:19

Enlico · Answer 1 · 2020-02-01T16:59:44.183

Hypothesizing an answer.

Given the input line ab, both (.)(.) and (.)(.)$ will match ab with the same length of 2. So, as you say in your question, the two regexps match with the same length from the same starting point.

However, I would say that, at the moment (.)(.) matches against ab, the engine would have one more check to do (against $) in order to check if (.)(.)$ matches too (i.e. if it's matching at the EOL), in which case the latter regexp would not be preferred anyway, because it has the same length and started at the same point as the former regex which was already matched. So it makes sense to me that the engine returns the references to the groups in (.)(.) without $.

I think this reasoning of mine implies that the matching is greedy against printable characters, but lazy against non-printable ones.

Compare/contrast with this:

echo ab|sed -r 's/^(.)(.)|(.)(.)/\2\1\3\4/'
ba

echo ab|sed -r 's/(.)(.)|^(.)(.)/\2\1\3\4/'
ba

where the leftmost regex matches in both cases.

Interesting find with the `^` example – user41805 Feb 11 '20 at 12:53 — user41805, Feb 11 '20 at 12:53

Tiebreaker for same-length regex alternatives with same starting position

1 Answers1

Linked