Replacing null bytes with `sed` vs `tr`

Question

Bash newbie; using this idiom to generate repeats of a string:

echo $(head -c $numrepeats /dev/zero | tr '\0' 'S')

I decided I wanted to replace each null byte with more than one character (eg. 'MyString' instead of just 'S'), so I tried the following with sed

echo $(head -c $numrepeats /dev/zero | sed 's/\0/MyString/g' )

But I just get an empty output. I realized I have to do

echo $(head -c $numrepeats /dev/zero | sed 's/\x0/MyString/g' )

or

echo $(head -c $numrepeats /dev/zero | sed 's/\x00/MyString/g' )

instead, but I don't understand why. What is the difference between the characters that tr and sed match? Is it because sed is matching against a regex?

Edit Interesting discovery that \0 in the replacement portion of the 's/regexp/replacement' sed command actually behaves the same as &. Still doesn't explain why \0 in regexp doesn't match the nullbyte though (as it does in tr and most other regex implementations)

first glance: just drop your echo and interpolation. they are redundant. — Jason Hu, Mar 04 '17 at 05:45
secondly, `sed` and `tr` are 2 different tools, there is no obligatory contract between both. — Jason Hu, Mar 04 '17 at 05:46
I know you're not asking this, but here's [a good way to repeat any string in Bash](http://stackoverflow.com/questions/5349718/how-can-i-repeat-a-character-in-bash). — Benjamin W., Mar 04 '17 at 06:29
@BenjaminW. yep, saw that before, but the problem is Bash does brace expansion before variable expansion, which means I can't specify a range to `printf` with a variable...this seemed to be the next best idiom I could find. — , Mar 05 '17 at 02:35
You could use [`seq`](https://www.gnu.org/software/coreutils/manual/html_node/seq-invocation.html#seq-invocation) instead, if you don't mind relying on an external (and non-POSIX) tool. — Benjamin W., Mar 05 '17 at 03:32

linuxfan says Reinstate Monica · Accepted Answer · 2017-03-04T06:45:04.730

8

From the manual page of tr(1):

SETs are specified as strings of characters ... Interpreted sequences are:
\NNN character with octal value NNN (1 to 3 octal digits)

For sed(1), the manual page is not so clear, so a few tries can show something:

echo -n hi |sed 's/h/t/g' |hexdump -c    (0000000   t   i)

Easy. Then:

echo -n hi |sed 's/h//g' |hexdump -c      (0000000   i)

Empty pattern deletes the match. Again easy. Then:

echo -n hi |sed 's/h/\0/g' |hexdump -c    (0000000   h   i)

This \0 seems to do nothing. So try

echo -n hi |sed 's/h/\00/g' |hexdump -c   (0000000   h   0   i)

Oh! Could it take \0 as a reference to the matched part? This would explain also the previous example. sed man page talks about \1 to \9, not \0 (but \0 has a meaning anyway, even in the pattern specification).

So, to cut it short: for sed, \0 has a special meaning which is not a NUL char. But it understands octal:

echo -n hi |sed 's/h/\o0/g' |hexdump -c    (0000000  \0   i)

and hexadecimal:

echo -n hi |sed 's/h/\x0/g' |hexdump -c    (0000000  \0   i)

As pointed out in the comments, tr and sed are different tools, designed differently. Yes, sed uses regexp while tr does not, but this is not the general explanation about \0 is interpreted differently. In the messy world of unix there are, often, some conventions. In the messy world of unix there are, more often, exceptions to those conventions.

edited Mar 04 '17 at 06:45

answered Mar 04 '17 at 06:37

linuxfan says Reinstate Monica

4,281
2
19
35

It makes sense that, in `replacement`, `\0` refers, like `&`, to the matched portion of `regexp` (although as you say, this is not explicitly stated in the manpages) ie. it has a special meaning which is not the nullbyte _but only in the `replacement` portion_. This still doesn't explain why `\0` in the `replacement` part doesn't match the nullbyte (as it does in `tr` and in most other regex implementations). – Mar 05 '17 at 03:05
Also pointing out that `tr` and `sed` have no obligation to behave the same doesn't answer the question either (yes, that much is obvious, I don't dispute the fact, I was just wondering if anyone knew of a coherent explanation or whether it's "just another Bash quirk") – Mar 05 '17 at 03:05
Nonetheless, thanks for uncovering the `\0` as `&` behaviour in the `replacement` portion, TIL – Mar 05 '17 at 03:06
@user141554 There is no bash quirk here (I verified): using single quotes makes all arguments plain and literal. `tr` uses the `\xxx` for octal notation (and lacks decimal and hex) while `sed` uses the `\x` to indicate a different thing - _not characters_. But it has octal decimal and hex. I find it enough coherent, apart the manual pages that often are cryptic and imprecise. I think your question has got an answer about the sed's substitute command _replacement_ part. The `\x` plays a role even in the sed's substitute _matching_ part but that's another story. – linuxfan says Reinstate Monica Mar 05 '17 at 06:59
Hmm, I think if `\0` in the `regexp` portion were similar to the special characters `\1` to `\9`, then we should get something like `sed: -e expression #1, char 9: Invalid back reference` if we tried to do something like `ls | sed 's/\0//g'` for instance. Because that is what we get for `ls | sed 's/\1//g'`. But for `\0` it just doesn't match and replace anything – Mar 05 '17 at 07:44
@user141554 I didn't find any docs about \0 in pattern; from a few tests, it seems that it is simply ignored. It makes some sense... if \1 .. \9 have a meaning, \0 has the same syntax but is not assigned so it is simply ignored/skipped. Perhaps a peek at the sources is the only way to clarify this. – linuxfan says Reinstate Monica Mar 06 '17 at 07:59
1

@user141554 I took a peek at sed sources (there are quite a few versions, I looked only at one). While examining regexp for substitute command, there is explicit reference to \1 .. \9, in a switch(), but never to \0. I didn't go deeper ... I think it is simply ignored. – linuxfan says Reinstate Monica Mar 08 '17 at 08:57
Alright, I guess the question has more or less evolved into a "what does that \0 do in the regexp part of sed", and my curiosity has been pretty much satisfied. So thanks! – Mar 10 '17 at 02:51

Scheff's Cat · Answer 2 · 2017-03-04T06:44:56.323

2

The latter two commands in the question does work:

$ sed --version
sed (GNU sed) 4.4
Packaged by Cygwin (4.4-1)

$ echo -e "Hello\0World" | hexdump.exe -c
0000000   H   e   l   l   o  \0   W   o   r   l   d  \n                
000000c

$ echo -e "Hello\0World" | sed 's/\x0/MyString/g'
HelloMyStringWorld

$ echo -e "Hello\0World" | sed 's/\x00/MyString/g'
HelloMyStringWorld

Octal sequences have to be prefixed by \o (thanks, Benjamin W., for this hint):

$ echo -e "Hello\0World" | sed 's/\o0/MyString/g'
HelloMyStringWorld

Thus, there must be another issue in the OP.

edited Mar 04 '17 at 06:44

answered Mar 04 '17 at 05:55

Scheff's Cat

19,528
6
28
56

2

I guess it depends on the sed version. My GNU sed 4.3 doesn't understand `\0`, only `\o0`, `\d0` or `\x0`. – Benjamin W. Mar 04 '17 at 06:38
This answer just shows precisely my point, you have to specify the null byte in `sed` (with `\x0` or `\x00` of `\o0` or whatever) differently than you do with `tr`. My question is why this is the case, if there is a proper answer to that at all. Or it may just be one of those Bash quirks which always catch beginners :/ – Mar 05 '17 at 02:42
@user141554 I guess it's actually not a "Bash quirk". The reg. expr. of `sed` is "secured" by single quotes. Thus, the bash may not be blamed for this. But I agree with you: I consider this as one of the less lucky design decisions although there might exist reasons for this. - There are a handful of really useful "standard" tools for text processing in Unix-likes. Each of them seems to have its own flavor of reg. expr. language. (Esp. when to use backslash for meta and when not drives me crazy...) – Scheff's Cat Mar 05 '17 at 08:00

score 1 · Answer 3 · answered Mar 05 '17 at 04:58

Specious question: there is no tr and sed per se. Rather there are versions of these programs across time and os platforms. Generally speaking UNIX's history is a rapid florescence of variation; more specifically tr was released for Version 4 Unix in 1973, while sed first appeared in Version 7 Unix in 1979. From the get-go, these were written by different authors, on different os, for different shells, with different purposes (note: Bash was written much latter in 1989 and is NOT the "owner" of either of these utilities). And, things only get more varied and complex in terms of how these programs independently evolved, were maintained (again by different authors), how/which bugs were fixed, etc. While much effort has been made of late to standardize core utilities, assuming that sed and tr would treat characters in the exact same way is failing to grok the history, the troublesome lack of standards as well the strangely beneficial plurality of UNIX itself.

Replacing null bytes with `sed` vs `tr`

3 Answers3