Comparing multiple columns of different files and appending a column from a file if there is a match

Question

I am having a problem while accessing the columns of a file in awk. I have two files, one has 12 columns and the other has 5 columns.

1.txt
chr1 10 20 . . + chr1 30 40 ABC . +
chr2 11 22 . . + chr2 90 92 XXX . -
chrX 33 42 . . + chrX 70 80 XXX . +
chr4 3  12 . . + chr4 70 80 ZZZ . +

And,

2.txt
1 chr1 30 40 ABC
3 chr1 35 40 ABC
27 chr2 90 92 XXX
1 chrX 70 80 XXX
2 chrY 12 13 XXX

I want to compare the 2nd,3rd,4th and 5th column of 2.txt with 7th,8th,9th,10th of 1.txt. If there is a match, it should print the whole line of 1.txt, and the 1st column of 2.txt.

Expected output:

chr1 10 20 . . + chr1 30 40 ABC . + 1
chr2 11 22 . . + chr2 90 92 XXX . - 27
chrX 33 42 . . + chrX 70 80 XXX . + 1

As I could not compare the 4 columns, I did it with two. And, I am able to compare the two columns of each (2nd and 3rd of 2.txt and 7th and 8th of 1.txt), and I can print a string if there is a match. But I cannot print the first column of first file. My code:

awk -F, 'NR==FNR {a[$2 FS $3];next} {print $0 FS (($7 FS $8) in a?"exists":"none")}' 2.txt 1.txt

What it makes (which I don't want):

chr1 10 20 . . + chr1 30 40 ABC . + exists
chr2 11 22 . . + chr2 90 92 XXX . - exists
chrX 33 42 . . + chrX 70 80 XXX . + exists
chr4 3  12 . . + chr4 70 80 ZZZ . + none

How can I change this new 13th column to the corresponding 1st column of 1.txt?

RomanPerekhrest · Accepted Answer · 2017-11-21T08:05:07.723

2

awk approach:

awk 'NR==FNR{ a[$2,$3,$4,$5]=$1; next }
     { s=SUBSEP; k=$7 s $8 s $9 s $10 }k in a{ print $0,a[k] }' 2.txt 1.txt

The output:

chr1 10 20 . . + chr1 30 40 ABC . + 1
chr2 11 22 . . + chr2 90 92 XXX . - 27
chrX 33 42 . . + chrX 70 80 XXX . + 1

edited Nov 21 '17 at 08:05

answered Nov 20 '17 at 14:16

RomanPerekhrest

88,541
4
65
105

It works thank you! Is it possible that you can explain how did you append the 1st column as 13th? – bapors Nov 20 '17 at 14:22
1

@bapors, welcome, `print $0,a[k]` will print the whole line `$0` from file `1.txt` (12 fields) and captured 1st field from `2.txt` given by `a[k]` – RomanPerekhrest Nov 20 '17 at 14:24
In here, if we want to keep the whole line of 1.txt even though there is no match, how can we do it? – bapors Jan 15 '18 at 14:04

RavinderSingh13 · Answer 2 · 2017-11-20T14:27:26.147

Following awk may help you in same.

awk 'FNR==NR{a[$2,$3,$4,$5]=$0;next} {printf("%s%s\n",$0,(($7,$8,$9,$10) in a)?" exists":" none")}' 2.txt 1.txt

Output will be as follows.

chr1 10 20 . . + chr1 30 40 ABC . + exists
chr2 11 22 . . + chr2 90 92 XXX . - exists
chrX 33 42 . . + chrX 70 80 XXX . + exists
chr4 3  12 . . + chr4 70 80 ZZZ . + none

Adding explanation too here on same.

awk '
FNR==NR{  ##Mentioning FNR==NR condition which will be TRUE only when first Input_file named 2.txt is being read.
  a[$2,$3,$4,$5]=$0; ##creating an array named a whose indexes are 2nd 3rd 4th and 5th field and value is current line value.
  next               ##next is awk default keyword which will skip all further statements.
}
{
  printf("%s%s\n",$0,(($7,$8,$9,$10) in a)?" exists":" none") ##Printing current line and a conditional statement result here, if $7,$8,$9,$10 are present in array a then print string exists in last of line or print none.
}' 2.txt 1.txt                                                ##Mentioning the Input_file names here.

However my expected 13th line is the corresponding 1st column of the '2.txt'...As : 'chr1 10 20 . . + chr1 30 40 ABC . + 1 chr2 11 22 . . + chr2 90 92 XXX . - 27 chrX 33 42 . . + chrX 70 80 XXX . + 1' — bapors, Nov 20 '17 at 14:17
@bapors, I added explanation too of my code here with expected output same as shown by you too now. — RavinderSingh13, Nov 20 '17 at 14:28

Comparing multiple columns of different files and appending a column from a file if there is a match

2 Answers2

Linked