0

I have 2 files, like these: file1.txt

chr pos ref totoal  A   C   T   G
chr1    2488104 A   8690    99.954  0.0230  0   0.0230
chr1    2488105 T   8847    0   99.954  0.022   0.0226
chr1    2488106 G   8902    0.011   0.0337  0   99.955
chr1    2488107 G   8875    0   0.02252 0   99.977
chr1    2488108 A   8674    99.919  0   0.011   0.0691
chr1    2488109 G   9116    0.021   0.0658  0   99.912
chr1    2488110 C   9191    0.087   0.0652  99.847  0
chr1    2488111 C   9291    0   0.0430  99.956  0
chr1    2488112 T   9254    0   100 0   0
chr1    2488113 C   9354    0   0.0427  99.957  0
chr1    2488114 C   9493    0   0.0842  99.915  0

and a file2.txt:

chr1    2488111 T   0.09
chr1    2488105 C   0.053
chr1    2488115 G   0.03

I would like to extract the numeric value of columns A-D of file1 if $1 and $2 matches between the two files. The choice of what column extract should depend on the $3 field of file2 (i.e. for "chr1 2488111 T" i would have back "99.956" since it is the value reported for chr1 position 2488111 in the column "T".

I found bunch of codes to match the fields between the two files, but I cannot have back just my value of interest. The AWK is the following:

awk '
BEGIN           {FS = OFS = "\t"
                }

NR==FNR         {for (n = split ($2, a); n>0; n--) CMP[$1, a[n]]
                 next
                }($1,$2) in CMP {print $0      
                }
'  file2.txt file1.txt

I would like to have back a single value extracted from file1 based on the match between fields and the match between the $3 of file2 and the header of file1.

The ideal output would be to report the original line of file2 with the value appended at the end of the line:

chr1    2488111 T   0.09   99.956
chr1    2488105 C   0.053   0
chr1    2488115 G   0.03   NA

1 Answers1

1

It doesn't match your posted expected output but it does what your question describes so I think it probably is what you're really looking for:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
    if (NR == 1) {
        for (i=1; i<=NF; i++) {
            fldNr2name[i] = $i
        }
        next
    }
    for (fldNr=3; fldNr<=NF; fldNr++) {
        fldName = fldNr2name[fldNr]
        vals[$1,$2,fldName] = $fldNr
    }
    next
}
{
    print $0, ( ($1,$2,$3) in vals ? vals[$1,$2,$3] : "NA" )
}

$ awk -f tst.awk file1 file2
chr1    2488111 T       0.09    99.956
chr1    2488105 C       0.053   99.954
chr1    2488115 G       0.03    NA
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 2
    That's is what I'm looking for! Thank you very much Mr. Morton – Filippo Vit Jul 21 '19 at 14:39
  • No, whatever the problem is it has nothing to do with how many rows or columns your files contain. You'll have to employ divide and conquer or similar to your input files to pinpoint the issue but first check if you have DOS line endings as that's often the culprit for non-obvious errors, see https://stackoverflow.com/q/45772525/1745001. No, I don't want to get a copy of your files to debug this, sorry. – Ed Morton Jul 22 '19 at 16:07
  • 1
    There was an error in the file format. Thank you again for your hepl! – Filippo Vit Jul 23 '19 at 06:32