Using GNU sed (with the -r
flag for clarity), the following two substitutions on the input string ab
give the same result:
s/(.)(.)|(.)(.)$/\2\1\3\4/
and
s/(.)(.)$|(.)(.)/\1\2\4\3/
both give ba
. It would appear that the alternative (.)(.)
(the one without $
) succeeds in both substitutions, regardless of whether its position as the first or second alternative. Why is this the case? What is the tie-breaker for such alternatives?
The POSIX specification of regular expressions specifies1 the tiebreaker for when the alternatives start at different positions (in which case the earlier one is favoured), and when they start at the same position but have different lengths (the longer one is favoured), but it does not appear to specify the behaviour of capturing groups when two alternatives start at the same position and have the same length, thus leaving it to the specific implementation.
The search for a matching sequence starts at the beginning of a string and stops when the first sequence matching the expression is found, where "first" is defined to mean "begins earliest in the string". If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the longest such sequence is matched. [...] – The Open Group Base Specifications Issue 7, 2018 edition
Here is a running example of the phenomenon.
echo ab|sed -r 's/(.)(.)|(.)(.)$/\2\1\3\4/'
echo ab|sed -r 's/(.)(.)$|(.)(.)/\1\2\4\3/'