14

I have a NUL delimited output coming from the following command :

some commands | grep -i -c -w -Z 'some regex'

The output consists of records of the format :

[file name]\0[pattern count]\0

I want to use text manipulation tools, such as sed/awk, to change the records to the following format :

[file name]:[pattern count]\0

But it seems that sed/awk usually handles only records delimited by the "newline" character. I would like to know that how sed/awk could be used to achieve my purpose, or if sed/awk could not handle such case what other Linux tool should I use.

Thanks for any suggestion.

Lawrence

user1129812
  • 2,859
  • 8
  • 32
  • 45
  • so how do you look at this file? with a hex editor? How does it know where to 'break' the lines? Why not just convert the '\0' to '\n' and have a nice easy to read file that can be processed using the standard unix paradigm? Otherwise at every step, you'll be fighting the basic law of unix, "each record on its own line" ! ;-) Life is too short, There are much more interesting problems to do battle with. Can you get the original source of output to use '\n' or ... shudder, '\r\n' ? Good luck. – shellter Feb 07 '12 at 03:17
  • The output is not to be displayed, it is piped into another command. I use NUL as separator as Linux file names could have "newline" character in it. I agree that life is only too short for us to figure out all the solutions for our questions. – user1129812 Feb 07 '12 at 03:50
  • but a filename is a different piece of 'data' than the data included in a pipe. the 2 only meet as an when data is written into file with a name that may have a '\n' in it. Good luck. – shellter Feb 07 '12 at 04:07
  • I finally figure out that `grep -c -Z` would only place a NUL character after `[file name]` but would place a "newline" character after `[pattern count]`. I now choose not to use the `grep -Z` option but TejasP's answer is still helpful for me to parse NUL delimited files using awk in the future. Thanks all. – user1129812 Feb 07 '12 at 06:03

4 Answers4

8

Since version 4.2.2, GNU sed has the -z or --null-data option to do exactly this. Eg:

sed -z 's/old/new' null_separated_infile
Graeme
  • 2,971
  • 21
  • 26
6

By default, the record separator is the newline character, defining a record to be a single line of text. You can use a different character by changing the built-in variable RS. The value of RS is a string that says how to separate records; the default value is \n, the string containing just a newline character.

 awk 'BEGIN { RS = "/" } ; { print $0 }' BBS-list
Yuri
  • 4,254
  • 1
  • 29
  • 46
Tejas Patil
  • 6,149
  • 1
  • 23
  • 38
  • 6
    I have tested that the command `awk 'BEGIN { RS = "\0" } ; { print $0 }'` could delimit records with the NUL character. But [The GNU Awk User's Guide](http://www.gnu.org/software/gawk/manual/html_node/Records.html) says that **RS = "\0" Is Not Portable**. Anyway, I could start with this command to try to change the NUL character before the [pattern count] to the ":" character in my case. – user1129812 Feb 07 '12 at 03:21
3

Yes, gawk can do this, set the record separator to \0. For example the command

gawk 'BEGIN { RS="\0"; FS="=" } $1=="LD_PRELOAD" { print $2 }' </proc/$(pidof mysqld)/environ

Will print out the value of the LD_PRELOAD variable:

/usr/lib/x86_64-linux-gnu/libjemalloc.so.1

The /proc/$PID/environ file is a NUL separated list of environment variables. I'm using it as an example, as it's easy to try on a linux system.

The BEGIN part sets the record separator to \0 and the field separator to = because I also want to extract the part after = based on the part before =.

The $1=="LD_PRELOAD" runs the block if the first field has the key I'm interested in.

The print $2 block prints out the string after =.


But mawk cannot parse input files separated with NUL. This is documented in man mawk:

BUGS
       mawk cannot handle ascii NUL \0 in the source or data files.

mawk will stop reading the input after the first \0 character.


You can also use xargs to handle NUL separated input, a bit non-intuitively, like this:

xargs -0 -n1 </proc/$$/environ

xargs is using echo as the default comand. -0 sets the input to be NUL separated. -n1 sets the max arguments to echo to be 1, this way the output will be separated by newlines.


And as Graeme's answer shows, sed can do this too.

Paul Tobias
  • 1,962
  • 18
  • 18
1

Using sed for removing the null characters -

sed 's/\x0/ /g' infile > outfile

or make in-file substitution by doing (this will make backup of your original file and overwrite your original file with substitutions).

sed -i.bak 's/\x0/ /g' infile

Using tr:

tr -d "\000" < infile > outfile
jaypal singh
  • 74,723
  • 23
  • 102
  • 147
  • 1
    or `tr "\000" "\n" < infile > output` :-?) – shellter Feb 07 '12 at 03:23
  • @shellter You are right. I was not sure if OP wanted to substitute them with newlines or remove them … :) – jaypal singh Feb 07 '12 at 03:37
  • But my purpose is to only replace the NUL character before the [pattern count], not to replace all NUL characters. – user1129812 Feb 07 '12 at 03:43
  • @user1129812 In that case you can use the `sed` command and remove the `g` option from it. `g` option is for making global substitutions. When removed, it will only make the change on first occurrence on each line. – jaypal singh Feb 07 '12 at 03:58