sed substitution using regex * quantifier

Question

I'm working through regular expression examples at the linux command line. Specifically I'm looking at the regex '*' qualifier that refers to 'zero or more occurrences of the preceding element' It's clear in the trivial example below why 'rrr' is substituted with 'x'

[..~]$ echo rrr | sed -re 's/r*/x/g'
x

It's not clear to me what is going on in the following two examples:

[..~]$ echo f | sed -re 's/r*/x/g'
xfx

[..~]$ echo fd | sed -re 's/r*/x/g'
xfxdx

Does sed encounter the 'f' as the first element in the text stream and determines there are zero occurrences of 'r', passes the 'x' to the stdout followed by 'f'? If so, why is there then a trailing 'x'?

Possible duplicate of [Regex plus vs star difference?](https://stackoverflow.com/questions/8575281/regex-plus-vs-star-difference) — wp78de, Mar 27 '18 at 06:31

score 0 · Answer 1 · answered Mar 25 '18 at 22:17

echo f | sed -re 's/r*/x/g'
xfx

The asterix addresses an arbitrary number of the preceding expression, 0, 1 or 723 times. So there is no r in front of the f and there are 0 rs behind f. And 0 repeatings is what your're looking for, replace it with x.

For at least one r, you would use the +:

echo f | sed -re 's/r+/x/g'
f

In principle, the same works here, except that between f and d, there is another pattern, matching zero or more rs.

echo fd | sed -re 's/r*/x/g'
xfxdx

score 0 · Answer 2 · answered Mar 25 '18 at 22:23

0

When you ask for "zero or more" of something then EVERYWHERE where zero of that thing can possibly be placed will match and be substituted - for example there are zero r in the space between every single character in the string and also at the beginning and end of the string.

So really you don't mean "zero or more" - you mean "one or more" because you are only expecting it to match a sequence of r if there were some r in the first place. "Zero or more" really does mean anywhere you could have an r but don't.

answered Mar 25 '18 at 22:23

Jerry Jeremiah

9,045
2
23
32

OK, I see that the way to approach this is then to consider that the logical consequence of 'zero or more' takes you into the spaces between the elements of the stream; i.e. everywhere). It makes sense. It just seems odd, when you approach this from the POV of processing a sequential text stream, that the spaces between elements are significant. I realize that if I had written a script to process a stream per the RE it would turn out to do what 'one or more' does :). There's a lot more going on under the hood than I initially thought. Thank you both. – hugo Mar 26 '18 at 01:01

sed substitution using regex * quantifier

2 Answers2