1

I wanted to ask a question following up on the one I posted before:

awk compare columns from two files, impute values of another column

I am trying to figure out how I can print NA when I have several unmatched values.

File1

rs1    AA    10
rs2    BB    20
rs3    CC    30
rs4    DD    40


File2

rs1    QQ   TT   UU
rs3    RR   WW   
rs4    ZZ      

Desired output

rs1    AA    10    QQ   TT   UU
rs2    DD    20    NA   NA   NA       
rs3    EE    30    RR   WW   NA
rs4    RR    40    ZZ   NA   NA

This code prints NA only when the entire $0 is missing:

awk 'FNR==NR{a[$1]=$0;next}{print $0,a[$1]?a[$1]:"NA"}' file2 file1

Current output:

rs1    AA    10    QQ    TT    UU
rs2    DD    20    NA
rs3    EE    30    RR    WW
rs4    RR    40    ZZ     
Community
  • 1
  • 1
user2162153
  • 267
  • 1
  • 4
  • 11

2 Answers2

1

Some like this?

awk 'FNR==NR{for (i=2;i<=NF;i++) a[i,$1]=$i;next}{printf "%s\t",$0; for (i=2;i<=6;i++) printf "%s\t",(a[i,$1]?a[i,$1]:"NA");print ""}' f2 f1
rs1    AA    10 QQ      TT      UU      NA      NA
rs2    BB    20 NA      NA      NA      NA      NA
rs3    CC    30 RR      WW      NA      NA      NA
rs4    DD    40 ZZ      NA      NA      NA      NA

Since you have a large file, you need to set the loop to the number of columns you like

Jotne
  • 40,548
  • 12
  • 51
  • 55
1

Try this:

awk '
  BEGIN {OFS = "\t"}
  FNR == NR {
    if (NF > 1) {
      if (NF > maxnf) maxnf = NF
      nf[$1] = NF
      a[$1] = $2
      for (i = 3; i <= NF; ++i) a[$1] = a[$1] "\t" $i
    }
    next
  }
  {
    if (NF < 3) {$3 = $2; $2 = " "}
    else         $1 = $1  # ensure fields are separated by tabs
    printf($0)
    n = 1
    if ($1 in a) {n = nf[$1]; printf("\t%s", a[$1])}
    for (i = n; i < maxnf; ++i) printf("\tNA");
    print""
  }
' file2 file1

This assumes that file1 has a fixed number of columns. In the output, columns are separated by tabs.

For space-separated output, pipe the output into expand -t 6, or however large you want your tabstops to be. With -t 6 it looks like this:

rs1   AA    10    QQ    TT    UU
rs2   BB    20    NA    NA    NA
rs3   CC    30    RR    WW    NA
rs4   DD    40    ZZ    NA    NA
ooga
  • 15,423
  • 2
  • 20
  • 21
  • Thanks, it works with the above example files and almost for my actual files.One question, I assume `i=3` refers to the first column you want to import? Other than the `"\t"` in `for (i = 3; i <= NF; ++i) a[$1] = a[$1] "\t" $i`, do I need to change other separators if my files are space-limited? – user2162153 Jul 09 '14 at 16:41
  • @user2162153 The `a[$1] = $2` line saves field 2. Then the loop appends fields 3 to NF, separated by tabs. See edit for converting tabs to spaces. What doesn't work for your actual files? – ooga Jul 09 '14 at 17:01
  • Thanks. It does not exist in the above examples, but if there is no value for the index column in `file2`, an empty field is imported, and as a result there is one blank field and then you have `NA`. e.g. if there is `rs4` in `file2` that does not have any values in the other fields, you would have `rs4 DD 40 NA NA` – user2162153 Jul 09 '14 at 17:12
  • @user2162153 A small edit (adding `&& NF > 1`) might fix that. Try the edited code above. – ooga Jul 09 '14 at 17:15
  • Thanks, sorry, one more question. If `file1` has empty fields in some columns, those get displaced in the output, e.g. if it was `rs1 10` instead of `rs1 AA 10`, `10` does not stay in the 3rd column in the output and gets shifted to the second column. – user2162153 Jul 09 '14 at 17:24
  • @user2162153 Assuming that `file1` always has 3 columns and that only the 2nd column can be missing, the above edit should work. If that's not the case, you'll need to describe *exactly* what your data *actually* looks like! – ooga Jul 09 '14 at 17:39
  • The above edit works for the format you described and my actual files. These questions came to my mind as we were discussing different possibilities for input files. Thanks. – user2162153 Jul 09 '14 at 18:43