Using AWK to merge two files based on multiple columns

Question

I have two CSV files, with ; (semicolon) as a separator, that I need to merge based on three columns on each file using AWK. The key columns are not consecutive. Idea is to get two columns from file B and print them after all the other columns from file A.

File A (keys are in A1, A3, and A5):

A1;A2;A3;A4;A5
K1;D1;K2;D2;K3
K4;D3;K5;D4;K6
K7;D5;K8;D6;K9
K1;D7;K2;D8;K3

File B (keys in B1, B2, B4):

B1;B2;B3;B4;B5
K1;K2;D9;K3;D0
K4;K5;DA;K6;DB
KA;KB;DC;KC;DD

Would produce:

A1;A2;A3;A4;A5;;
K1;D1;K2;D2;K3;D9;D0
K4;D3;K5;D4;K6;DA;DB
K7;D5;K8;D6;K9;;
K1;D7;K2;D8;K3;D9;D0

I have found several examples here in SO (for example How to merge two files based on the first three columns using awk and How to merge two files using AWK?) and elsewhere but I haven't been able to convert them to my needs, as they haven't been documented so well that an AWK n00b like myself would really understand how they work.

Closest I've gotten is:

awk -F \; -v OFS=\; 'FNR==NR{c[$1]=$3 FS $5;next}{ print $0, c[$1]}' B A

But it still leaves out one semicolon--or a column--from output lines 1 and 4:

A1;A2;A3;A4;A5;
K1;D1;K2;D2;K3;D9;D0
K4;D3;K5;D4;K6;DA;DB
K7;D5;K8;D6;K9;
K1;D7;K2;D8;K3;D9;D0

An how do I state which columns I want to use for comparing? Apparently now it's only using first column for comparing.

could you be more specific in terms of Input, desired output, how the (and which) columns should be compared? Which columns in file A must match which ones from file B, and which columns should be printed in the output then? — FelixJN, Aug 14 '15 at 12:49
Sorry if it was unclear but files A and B should merge comparing columns (A1, A3, A5) to (B1, B2, B4). What comes to output, all columns should output from A and the "non-key" columns (B3 and B5) from B. You are right, that output example was a bit unclear on that. I tried to use bold in it but it doesn't work in code block. — James Brown, Aug 14 '15 at 13:36
thought about it, i.e. creating a dummy comparison field from A1,A3,A5 to B1,B2,B4 without delimiter and do the join based on these. — FelixJN, Aug 14 '15 at 13:46
I guess not. Are you going to suggest using the join command? Join only handles one column and all this data is going to end-up in database so I could do it there but I WANT to learn AWK and use it. — James Brown, Aug 14 '15 at 13:51
Your example code is only using $1 as key, not the other 2 fields. — karakfa, Aug 14 '15 at 16:23

karakfa · Accepted Answer · 2015-08-14T16:22:17.947

This will print without the extra ; on unmatched lines. You have to provide B file first.

 awk 'BEGIN {
          OFS=FS=";"
      } 

      FNR==NR {
          key[$1 FS $2 FS $4]=$3 OFS $5
      } 

      FNR!=NR {
          c=$1 FS $3 FS $5; 
          if(c in key) 
               print $0,key[c]; 
          else 
               print
      }'  fileB fileA

if you need the extra delimiters, change the last print to print $0 OFS OFS

FelixJN · Answer 2 · 2015-08-14T14:06:53.510

1

join -j1 -a1 -t';' <(cut -d';' -f 1,3,5 A | sed -e 's/;//g' | paste -d';' - A | sort ) <(cut -d';' -f 1,2,4 B | sed -e 's/;//g' | paste -d';' - B | sort ) | cut -d';' -f2,3,4,5,6,9,11

#the commands on new lines for readability only
#join command, print all of file A, even if unmatching
join -j1 -a1 -t';'
#input from file A
<(cut -d';' -f 1,3,5 A | sed -e 's/;//g' | paste -d';' - A | sort )
#input from file B
<(cut -d';' -f 1,2,4 B | sed -e 's/;//g' | paste -d';' - B | sort )
#selecting the columns
| cut -d';' -f2,3,4,5,6,9,11

in each case:

1) create a dummy field from the desired columns of file A or B

2) then use paste to create each pseudo file as dummy comparison field; rest of file

3) sort the output for usability with join

4) use join on basis of the dummy field

5) cut the desired columns from the matches join produces

edited Aug 14 '15 at 14:06

answered Aug 14 '15 at 13:55

FelixJN

540
3
16

It excluded lines 1 and 4 in the desired output. – James Brown Aug 14 '15 at 14:02
Oh, I skipped that you want the unmatched lines of `A`, too. Updated, see `-a1` in `join` command. – FelixJN Aug 14 '15 at 14:04
note that it does not print `;;` for empty fields or a single `;` as last character - I hope that is of less relevance – FelixJN Aug 14 '15 at 14:08

score 0 · Answer 3 · answered Aug 14 '15 at 15:41

Not sure if I understood the requirement properly, but this gives the expected output for the given input:

awk -F \; -v OFS=\; 'FNR==NR{c[$1]=$3 FS $5;next}{ print $0, $1 in c ? c[$1] : ";"}' B A
A1;A2;A3;A4;A5;;
K1;D1;K2;D2;K3;D9;D0
K4;D3;K5;D4;K6;DA;DB
K7;D5;K8;D6;K9;;
K1;D7;K2;D8;K3;D9;D0

From the code in the question, I changed the print statement from

print $0, c[$1]

to

print $0, $1 in c ? c[$1] : ";"

Using AWK to merge two files based on multiple columns

3 Answers3

Linked