0

I have 2 tab separated file like these small examples:

example1:

RBM3    1517    993 -0.611355
RBM4    142 142 0
PRKAG1  146 73  -1
MORF4L2 1766    715 -1.30447

example2:

PCNP    370 139 -1.41244
RBM3    60  60  0
COTL1   338 252 -0.4236
PRKAG1  276 225 -0.294743

I want to get the common rows based on column 1 (in both files) and make a new file with 7 columns in which the 1st column is the 1st column in the original files and columns 2, 3 and 4 are from the 1st file and columns 5, 6 and 7 are from 2nd file (columns 2, 3 and 4). here is the expected output:

expected output:

RBM3    1517    993 -0.611355   60  60  0
PRKAG1  146 73  -1  276 225 -0.294743

I am trying to do that in AWK using the following code:

awk -v OFS="\t" 'NR==FNR {n[$2]=$1;next} ($2 in n) {print $1, $2, $3, $4, n[$2], n[$3], n[$4]}' file1 file2  > results.txt

but the results is not correct. do you have any idea how to fix it?

elly
  • 317
  • 1
  • 2
  • 11

3 Answers3

1

Here is one way:

$ awk -v OFS="\t" '
NR==FNR {              # file2
    k=$1               # set key
    $1=""              # nullify $1, OFS stays
    a[k]=$0            # hash record on k
    next
}
($1 in a) {            # file1, if $1 matches in a
    print $0 a[$1]     # output record and a
}' file2 file1         # mind the order
RBM3    1517    993     -0.611355       60      60      0
PRKAG1  146     73      -1      276     225     -0.294743

Using your approach would be something like (untested):

$ awk -v OFS="\t" '
NR==FNR {
    n[$1]=$2; o[$1]=$3; p[$1]=$4
    next
} 
($1 in n) {
    print $1, $2, $3, $4, n[$1], o[$1], p[$1]
}' file2 file1  > results.txt
James Brown
  • 36,089
  • 7
  • 43
  • 59
1

Once again, join is better than awk for this:

$ join -j1 <(sort -k1 file1.txt) <(sort -k1 file2.txt) | sed 's/ /\t/g'      
PRKAG1  146 73  -1  276 225 -0.294743
RBM3    1517    993 -0.611355   60  60  0

See my answer in your other post for why the files are being sorted; if you're using my solution from that to generate these input files, they'll already be sorted and you can just use the files directly. This time it uses sed to make sure the output has tab-separated columns.


As pointed out in a comment, some shells (bash, zsh, others) let you use $'\t' to get a tab character. On those, you can use join -t $'\t' -j1 sortedfile1.txt sortedfile2.txt and leave out the sed bit. Others like dash don't, though, so the first version is more portable.

Shawn
  • 47,241
  • 3
  • 26
  • 60
  • `<(sort file)` won't work in dash either, but apparently `'-t '` (tab in quotes with the `-t`) works in dash and bash. – James Brown Feb 06 '19 at 13:25
  • 1
    @JamesBrown Huh. TIL. Could have sworn that sort of redirection was POSIX syntax. – Shawn Feb 06 '19 at 13:28
  • 1
    Even ksh supports it, and the POSIX shell was based on that. Maybe it was added in ksh93? Oh well. – Shawn Feb 06 '19 at 13:39
0

sort + awk

sort elly1.txt elly2.txt | awk ' {c=$1; if(c==p) {$1=""; print c,a,$0 } p=c;$1="";a=$0 } ' | sed 's/ +/\t/g'

with the given inputs

$ cat elly1.txt
RBM3    1517    993 -0.611355
RBM4    142 142 0
PRKAG1  146 73  -1
MORF4L2 1766    715 -1.30447

$ cat elly2.txt
PCNP    370 139 -1.41244
RBM3    60  60  0
COTL1   338 252 -0.4236
PRKAG1  276 225 -0.294743

$ sort elly1.txt elly2.txt | awk ' {c=$1; if(c==p) {$1=""; print c,a,$0 } p=c;$1="";a=$0 } ' | sed 's/ +/\t/g'
PRKAG1  146 73 -1  276 225 -0.294743
RBM3  1517 993 -0.611355  60 60 0

$
stack0114106
  • 8,534
  • 3
  • 13
  • 38