2

When doing a substitution that includes something like ^|. in the REGEXP sed doesn't match the null string at beginning of the pattern space if the first character matches. It also doesn't match the end if the last character matches. Why is that?

Here are some examples using 123 as input (with the -r option):

substitution    expected output     actual output   comments
s/^/x/g         x123                x123            works as expected
s/$/x/g         123x                123x            works as expected
s/^|$/x/g       x123x               x123x           works as expected
s/^|./x/g       xxxx                xxx             didn't match the very begining
s/.|$/x/g       xxxx                xxx             didn't match the very end
s/^|1/x/g       xx23                x23             didn't match the very begining
s/^|2/x/g       x1x3                x1x3            this time it did match the begining

I get the same results when using \` instead of ^.
I've tried GNU sed version 4.2.1 and 4.2.2

Try it online!

Riley
  • 698
  • 6
  • 11

1 Answers1

4

AFAIK sed will try to match the longest match in an alternation.

So when the null string at the beginning of the pattern space can be matched vs. 1 at the same position. 1 is chosen as it's the longest match.

Consider the following:

$ sed 's/12\|123/x/g' <<< 123
x
$ sed 's/123\|12/x/g' <<< 123
x
$ sed 's/^1\|12/x/g' <<< 123
x3

The same applies when reaching the end. Lets break sed 's/.\|$/x/g' <<< 123 down:

123
^
. matches and replace with x
x23
 ^
 . matches and replace with x
xx3
  ^
  . matches and replace with x
xxx
   ^
   Out of pattern space $ will not match.
Andreas Louv
  • 46,145
  • 13
  • 104
  • 123
  • So it doesn't treat the null string at the beginning as being in it's own position? It's part of the first character in a way? – Riley Oct 01 '16 at 16:51
  • `^` matches said null string, and the length of the match is 0, while `1` at the first position have a length of 1. So that is replaced. Both will be matched but only the longest will be replaced. – Andreas Louv Oct 01 '16 at 17:06
  • Why aren't they both replaced? `sed 's/12\|123/x/g <<<12123` replaces `12` and `123` even though `123` is longer. – Riley Oct 01 '16 at 17:13
  • Your example is different as `12` are matched and replaced then `123`. Consider this: `sed 's/11\|111//g' <<< 1111`. How many ones are left after the replacement. zero or one? – Andreas Louv Oct 01 '16 at 17:26
  • Why is `s/^|1/x/g <<< 123` different from `s/1|2/x/g <<< 123`? The second replaces 1 then 2, shouldn't the first replace the null string then the 1? – Riley Oct 01 '16 at 17:31
  • In the first substitution the pattern cursor started at the same position, and therefore only one of the two found matches will be replaced. – Andreas Louv Oct 01 '16 at 17:34
  • I guess that makes sense, but I would have done it differently. Thanks. – Riley Oct 01 '16 at 17:36
  • I guess it's unexpected behavior. Looking at Perl it will do what you expect: `perl -pe 's/^|./x/' <<< 123` will output `xxxx`. While vim will take the first match (perl will do the same so `s/.|^/x/` will result in `xxx`): `vim - -u NONE +':s/^\|./x/g' <<< '123'` will have `x1xx` in the buffer. – Andreas Louv Oct 01 '16 at 17:42