combining and processing 2 tab separated files in awk and make a new one

Question

I have 2 tab separated files with 2 columns. column1 1 is number and column 2 is ID. like these 2 examples:

example file1:

188 TPT1
133 ACTR2
420 ATP5C1
942 DNAJA1

example file1:

91  PSMD7
2217    TPT1
223 ATP5C1
156 TCP1

I want to find the common rows of 2 files based on column 2 (column ID) and make a new tab separated file in which there are 4 columns: column1 is ID (common ID) column2 is the number from file1, column3 is the number from file2 and column4 is the log2 values of ratio of columns 2 and 3 (which means log2(column2/column3)). for example regarding the ID "TPT1": 1st column is TPT1, column2 is 188, column3 is 2217 and column 4 is log2(188/2217) which is equal to -3.561494. here is a the expected output:

expected output:

TPT1    188 2217    -3.561494
ATP5C1  420 223 0.9133394

I am trying to do that in AWK using the following code:

awk 'NR==FNR { n[$2]=$0;next } ($2 in n) { print n[$2 '\t' $1] '\t' $1 '\t' log(n[$1]/$1)}' file1.txt file2.txt  > result.txt

this code does not return what I expect. do you know how to fix it?

Possible duplicate of [How to escape single quotes within single quoted strings?](https://stackoverflow.com/questions/1250079/how-to-escape-single-quotes-within-single-quoted-strings) — tripleee, Feb 06 '19 at 11:09
No, that's exactly the reason your attempt didn't work, though there are additional problems as well (you don't use `$1 "\t" $2` as a key in `n` so it will be empty). You can't have single quotes in your Awk script if it's in single quotes. Switch to double quotes or a workaround like in the answer you got. — tripleee, Feb 06 '19 at 11:19

score 1 · Answer 1 · answered Feb 06 '19 at 11:17

1

$ awk -v OFS="\t" 'NR==FNR {n[$2]=$1;next} ($2 in n) {print $2, $1, n[$2], log(n[$2]/$1)/log(2)}' file1 file2 
TPT1    2217    188  -3.5598
ATP5C1  223     420  0.913346

answered Feb 06 '19 at 11:17

jas

10,715
2
30
41

What does the snippet does? Please explain it. – common sense Feb 06 '19 at 12:12
The changes in comparison with the OP's attempt are very small. You should be able to figure out from the information in the question. – tripleee Feb 06 '19 at 12:48

Shawn · Answer 2 · 2019-02-06T12:53:53.050

I'd use join to actually merge the files instead of awk:

$ join -j2 <(sort -k2 file1.txt) <(sort -k2 file2.txt) |
   awk -v OFS="\t" '{ print $1, $2, $3, log($2/$3)/log(2) }'
ATP5C1  420 223 0.913346
TPT1    188 2217    -3.5598

The join program, well, joins two files on a common value. It does require the files to be sorted based on the join column, but your examples aren't, hence the inline sorting of the data files. Its output is then piped to awk to compute the log2 of the numbers of each line and produce tab-delimited results.

Alternative using perl which gives you more default precision if you care about that (And don't want to mess with awk's CONVFMT variable):

$ join -j2 <(sort -k2 a.txt) <(sort -k2 b.txt) |
   perl -lane 'print join("\t", @F, log($F[1]/$F[2])/log(2))'
ATP5C1  420 223 0.913345617745818
TPT1    188 2217    -3.55980420318967

score 0 · Answer 3 · answered Feb 06 '19 at 14:44

awk + sort approach

 awk ' { print $0,FILENAME }' ellyx.txt ellyy.txt | sort -k2 -k3 | awk ' {c=$2;if(c==p) { print c,a,$1,log(a/$1)/log(2) }p=c;a=$1 } '

with the given inputs

$ cat ellyx.txt
188 TPT1
133 ACTR2
420 ATP5C1
942 DNAJA1

$ cat ellyy.txt
91  PSMD7
2217 TPT1
223 ATP5C1
156 TCP1

$ awk ' { print $0,FILENAME }' ellyx.txt ellyy.txt | sort -k2 -k3 | awk ' {c=$2;if(c==p) { print c,a,$1,log(a/$1)/log(2) }p=c;a=$1 } '
ATP5C1 420 223 0.913346
TPT1 188 2217 -3.5598

$

combining and processing 2 tab separated files in awk and make a new one

3 Answers3

Linked