How to merge two files based on the first three columns using awk

Question

I wanted to merge two files into a single one line by line using the first three columns as a key. Example:

file1.txt

a b c 1 4 7
x y z 2 5 8
p q r 3 6 9

file2.txt

p q r 11
a b c 12
x y z 13

My desired output for the above two files is:

a b c 1 4 7 12
x y z 2 5 8 13
p q r 3 6 9 11

The number of columns in each file is not fixed, it can vary from line to line. Also, I got more than 27K lines in each file.

They are not ordered. They only thing is that the first three fields are the same for both files.

I've voted to close this question because it appears to be a request for a recommendation for a tool or solution, rather than a request for assistance with your own code. This makes your question off-topic for StackOverflow. If that assessment was incorrect, and you do indeed want help writing your own code, then please [add your work so far to your question](https://stackoverflow.com/posts/15366829/edit) and I'll retract my close vote. — ghoti, Aug 08 '17 at 20:25

score 1 · Answer 1 · answered Mar 12 '13 at 17:12

You could also use join, it requires sorted input and that the first 3 fields are merged. The example below sorts each file and lets sed merge and separate the fields:

join <(sort file1.txt | sed 's/ /-/; s/ /-/') \
     <(sort file2.txt | sed 's/ /-/; s/ /-/') |
sed 's/-/ /; s/-/ /'

Output:

a b c 1 4 7 12
p q r 3 6 9 11
x y z 2 5 8 13

Chris Seymour · Answer 2 · 2013-03-14T12:42:52.130

Join on the first three fields where the number of fields are variable (four or more):

{
    # get the forth field until the last
    for (i=4;i<=NF;i++)
        f=f$i" "

    # concat fields
    arr[$1OFS$2OFS$3]=arr[$1OFS$2OFS$3]f;
    # reset field string
    f=""    
}    
END {
    for (key in arr)
        print key, arr[key]    
}

Run like:

$ awk -f script.awk file1 file2
a b c 1 4 7 12 
p q r 3 6 9 11 
x y z 2 5 8 13

score 0 · Answer 3 · answered Mar 12 '13 at 16:36

0

try this:

 awk 'NR==FNR{a[$1$2$3]=$4;next}$1$2$3 in a{print $0, a[$1$2$3]}' file2 file1

answered Mar 12 '13 at 16:36

Kent

189,393
32
233
301

This assumes exactly 4 columns. It also assumes that concatenating the first three columns yields a unique key (true for the example input, but maybe not true for the real files). – OpenSauce Mar 12 '13 at 16:40
2

@OpenSauce OP knows the format of real file. if OP has that requirement, I could change the codes – Kent Mar 12 '13 at 16:50
indeed, that part is easy to fix, but I wanted to point it out in case the OP wasn't aware. But the bigger problem is "The number of columns in each file is not fixed, it can vary from line to line". You only save `$4` in your array, but I think you need to loop from `$4` to `NF`. – OpenSauce Mar 12 '13 at 16:59
$4 columns in which file? in both? How do I do that if I have N columns? thanks a lot. – user2161903 Mar 12 '13 at 17:49

Scrutinizer · Answer 4 · 2013-03-14T10:03:52.613

0

If the columns have varying lengths, you could try something like this using SUBSEP:

awk 'NR==FNR{A[$1,$2,$3]=$4; next}($1,$2,$3) in A{print $0, A[$1,$2,$3]}' file2 file1

For varying columns in file1 and sorted output, try:

awk '{$1=$1; i=$1 FS $2 FS $3 FS; sub(i,x)} NR==FNR{A[i]=$0; next}i in A{print i $0, A[i]}' file2 file1 | sort

edited Mar 14 '13 at 10:03

answered Mar 12 '13 at 16:45

Scrutinizer

9,608
1
21
22

file2 has 7 fixed columns but file1 can have 3 or more. It's also okay if it works for file1 with 3 columns. I want them to be ordered in the way they are in file1. Thanks a lot. – user2161903 Mar 12 '13 at 17:44

How to merge two files based on the first three columns using awk

4 Answers4

Linked