1

I have a tsv, file1, that is structured as follows:

col1  col2  col3
1     4     3
22    0     8
3           5

so that the last line would look something like 3\t\t5, if it was printed out. I'd like to replace that empty string with 'NA', so that the line would then be 3\tNA\t5. What is the easiest way to go about this using the command line?

orange1
  • 2,871
  • 3
  • 32
  • 58

2 Answers2

2

A straightforward approach is

sed -i 's/^\t/NA\t/;s/\t$/\tNA/;:0 s/\t\t/\tNA\t/;t0' file
  • sed -i edit file in place;

  • s/a/b/ replace a with b;

  • s/^\t/\tNA/ replace \t in the beginning of the line with NA\t (the first column becomes NA);

  • s/\t$/\tNA/ the same for the last column;

  • s/\t\t/\tNA\t/ insert NA in between \t\t;

  • :0 s///; t0 repeat s/// if there was a replacement (in case there are other missing values in the line).

Andrey
  • 2,503
  • 3
  • 30
  • 39
  • An explanation could be useful for those unfamiliar with `sed`. – nneonneo May 07 '15 at 19:11
  • An explanation of what this does / how this works would make this a useful answer. – Stephen P May 07 '15 at 19:12
  • I'm getting the following error with this: `sed -i 's/^\t/NA\t/;s/\t$/\tNA/;:0 s/\t\t/\tNA\t/;t0' file1` sed: 1: "file1": invalid command code f – orange1 May 07 '15 at 19:26
  • Try this: `echo -e '1\t2\t3\t4\n\t\t\t' | sed 's/^\t/NA\t/;s/\t$/\tNA/;:0 s/\t\t/\tNA\t/;t0'` . If it works, the issue is with the file or with `-i` option (see [http://stackoverflow.com/questions/7573368/in-place-edits-with-sed-on-os-x here]). – Andrey May 07 '15 at 19:34
2

awk is designed for this scenario (among a million others ;-) )

  awk -F"\t" -v OFS="\t" '{
       for (i=1;i<=NF;i++) {
         if ($i == "") $i="NA"
       }
       print $0
 }' file > file.new && mv file.new file

-F="\t" indicates that the field separator (also known as FS internally to awk) is the tab character. We also set the output field separator (OFS) to "\t".

NF is the number of fields on a line of data. $i gets evaluated as $1, $2, $3, ... for each value between 1 and NF.

We test if the $i th element is empty with if ($i == "") and when it is, we change the $i th element to contain the string "NA".

For each line of input, we print the line's ($0) value.

Outside the awk script, we write the output to a temp file, i.e. file > file.new. The && tests that the awk script exited without errors, and if OK, then moves the file.new over the original file. Depending on the safety and security use-case your project requires, you may not want to "destroy" your original file.

IHTH.

shellter
  • 36,525
  • 7
  • 83
  • 90
  • @orange1: added explanation. Are you using the `orange` data-science environment by any chance? Good luck. – shellter May 07 '15 at 19:26
  • Thanks for the additional explanation. I am not, actually. Gave it a quick google too, and couldn't find it. – orange1 May 08 '15 at 15:53