Using awk to update some columns of a file using the columns of another file based on one common column

Question

I want to update to file1.txt

ALL PDXX1   doi:10.1038/nature  GRCh37  SNP 10  100003305   100003305   G   A   SOMATIC:Whole genome    1
ALL PDXX1   doi:10.1038/nature  GRCh37  SNP 10  100066996   100066996   G   A   SOMATIC:Whole genome    2
ALL PDXX1   doi:10.1038/nature  GRCh37  SNP 10  100186066   100186066   C   T   SOMATIC:Whole genome    3
ALL PDXX1   doi:10.1038/nature  GRCh37  SNP 10  100291988   100291988   C   G   SOMATIC:Whole genome    4
ALL PDXX1   doi:10.1038/nature  GRCh38  SNP 10  100411167   100411167   G   A   SOMATIC:Whole genome    5
ALL PDXX1   doi:10.1038/nature  GRCh38  SNP 10  100437011   100437011   C   T   SOMATIC:Whole genome    6
ALL PDXX1   doi:10.1038/nature  GRCh38  SNP 10  101056011   101056011   G   C   SOMATIC:Whole genome    7
ALL PDXX1   doi:10.1038/nature  GRCh37  SNP 10  101072576   101072576   G   C   SOMATIC:Whole genome; dbSNP v147    8
ALL PDXX1   doi:10.1038/nature  GRCh37  SNP 10  101127004   101127004   G   A   SOMATIC:Whole genome    9
ALL PDXX1   doi:10.1038/nature  GRCh37  SNP 10  101695339   101695339   G   A   SOMATIC:Whole genome    10

using file2.txt

10      200000000       200000000   5
10      300000000       300000000   6
10      400000000       400000000   7

My aim is to get this output file based on the common last column of two files and update 4th, 6th, 7th, 8th columns of the file1.txt using the values from file2.txt.

ALL PDXX1   doi:10.1038/nature  GRCh37  SNP 10  100003305   100003305   G   A   SOMATIC:Whole genome    1
ALL PDXX1   doi:10.1038/nature  GRCh37  SNP 10  100066996   100066996   G   A   SOMATIC:Whole genome    2
ALL PDXX1   doi:10.1038/nature  GRCh37  SNP 10  100186066   100186066   C   T   SOMATIC:Whole genome    3
ALL PDXX1   doi:10.1038/nature  GRCh37  SNP 10  100291988   100291988   C   G   SOMATIC:Whole genome    4
ALL PDXX1   doi:10.1038/nature  GRCh37  SNP 10  200000000   200000000   G   A   SOMATIC:Whole genome    5
ALL PDXX1   doi:10.1038/nature  GRCh37  SNP 10  300000000   300000000   C   T   SOMATIC:Whole genome    6
ALL PDXX1   doi:10.1038/nature  GRCh37  SNP 10  400000000   400000000   G   C   SOMATIC:Whole genome    7
ALL PDXX1   doi:10.1038/nature  GRCh37  SNP 10  101072576   101072576   G   C   SOMATIC:Whole genome; dbSNP v147    8
ALL PDXX1   doi:10.1038/nature  GRCh37  SNP 10  101127004   101127004   G   A   SOMATIC:Whole genome    9
ALL PDXX1   doi:10.1038/nature  GRCh37  SNP 10  101695339   101695339   G   A   SOMATIC:Whole genome    10

I tried

 awk -F"\t" 'NR==FNR{a[$4];chr[$1];start[$2];end[$3];next}{if($12 in a) $4=a[$12]; $6=chr[$12]; $7=start[$12]; $8=end[$12]; print FILENAME " " NR " " FNR " " $0}' file2.txt file1.txt

But could not succeed yet.

consider making a smaller set of sample data (Do you really need 12 colums or more to understand how to solve your problem?) and also include the output you are currently getting. Just saying "Zbut could not succeed yet" doesn't really help much. What you have looks reasonable, so at least show us your current output and it will be easier to diagnose your problem. Good luck. — shellter, Mar 12 '20 at 00:09
In fact, this is very small version of the real dataset. Anyway thanks :) — burcak, Mar 12 '20 at 00:36

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

I think I solved it.

awk -F"\t" 'NR==FNR{a[$4]=$4;chr[$4]=$1;start[$4]=$2;end[$4]=$3;next}{if($12 in a) {OFS="\t"; $4="GRCh37"; $7=start[$12]; $8=end[$12]; print $0;} else {print $0}}' file2.txt file1.txt

This one line command outputs updated file1.txt with the columns from file2.txt.

I would like to explain what I'm doing here for people like me (who are new to awk). NR==FNR is only true for the first file by its nature (What are NR and FNR and what does "NR==FNR" imply?).

So we get the 4th column of file2.txt in a array. I also save the 1st, 2nd and 3rd columns of file2.txt in chr, start and end arrays.

next let us move to the second file which is file1.txt in our case.

For each value in 12th column of file1.txt we check whether that value is in a, if yes we update the 4th, 7th and 8th columns of file1.txt with the saved values in chr,start and end arrays, respectively.

We saved values in a, chr, start and end arrays using $4 (using the row's values satisfied in the 4th column) as index. And later on we get the corresponding values using $12 (12th column of file1) as index. Just think this as we are getting saved values using indexing in arrays.

Hope it would be helpful for others.

also look at the `join` command. It is designed for such projects. It does required that files are sorted by the key columns. to be used. Glad you found a solution. (I didn't downvote). — shellter, Mar 12 '20 at 00:46

Using awk to update some columns of a file using the columns of another file based on one common column

1 Answers1