0

I have two CSV files, with ; (semicolon) as a separator, that I need to merge based on three columns on each file using AWK. The key columns are not consecutive. Idea is to get two columns from file B and print them after all the other columns from file A.

File A (keys are in A1, A3, and A5):

A1;A2;A3;A4;A5
K1;D1;K2;D2;K3
K4;D3;K5;D4;K6
K7;D5;K8;D6;K9
K1;D7;K2;D8;K3

File B (keys in B1, B2, B4):

B1;B2;B3;B4;B5
K1;K2;D9;K3;D0
K4;K5;DA;K6;DB
KA;KB;DC;KC;DD

Would produce:

A1;A2;A3;A4;A5;;
K1;D1;K2;D2;K3;D9;D0
K4;D3;K5;D4;K6;DA;DB
K7;D5;K8;D6;K9;;
K1;D7;K2;D8;K3;D9;D0

I have found several examples here in SO (for example How to merge two files based on the first three columns using awk and How to merge two files using AWK?) and elsewhere but I haven't been able to convert them to my needs, as they haven't been documented so well that an AWK n00b like myself would really understand how they work.

Closest I've gotten is:

awk -F \; -v OFS=\; 'FNR==NR{c[$1]=$3 FS $5;next}{ print $0, c[$1]}' B A

But it still leaves out one semicolon--or a column--from output lines 1 and 4:

A1;A2;A3;A4;A5;
K1;D1;K2;D2;K3;D9;D0
K4;D3;K5;D4;K6;DA;DB
K7;D5;K8;D6;K9;
K1;D7;K2;D8;K3;D9;D0

An how do I state which columns I want to use for comparing? Apparently now it's only using first column for comparing.

Community
  • 1
  • 1
James Brown
  • 36,089
  • 7
  • 43
  • 59
  • could you be more specific in terms of Input, desired output, how the (and which) columns should be compared? Which columns in file A must match which ones from file B, and which columns should be printed in the output then? – FelixJN Aug 14 '15 at 12:49
  • Sorry if it was unclear but files A and B should merge comparing columns (A1, A3, A5) to (B1, B2, B4). What comes to output, all columns should output from A and the "non-key" columns (B3 and B5) from B. You are right, that output example was a bit unclear on that. I tried to use bold in it but it doesn't work in code block. – James Brown Aug 14 '15 at 13:36
  • problem if the files are sorted? – FelixJN Aug 14 '15 at 13:41
  • thought about it, i.e. creating a dummy comparison field from A1,A3,A5 to B1,B2,B4 without delimiter and do the join based on these. – FelixJN Aug 14 '15 at 13:46
  • I guess not. Are you going to suggest using the join command? Join only handles one column and all this data is going to end-up in database so I could do it there but I WANT to learn AWK and use it. – James Brown Aug 14 '15 at 13:51
  • I see, well, the workaround is below. – FelixJN Aug 14 '15 at 14:00
  • 1
    Your example code is only using $1 as key, not the other 2 fields. – karakfa Aug 14 '15 at 16:23

3 Answers3

2

This will print without the extra ; on unmatched lines. You have to provide B file first.

 awk 'BEGIN {
          OFS=FS=";"
      } 

      FNR==NR {
          key[$1 FS $2 FS $4]=$3 OFS $5
      } 

      FNR!=NR {
          c=$1 FS $3 FS $5; 
          if(c in key) 
               print $0,key[c]; 
          else 
               print
      }'  fileB fileA

if you need the extra delimiters, change the last print to print $0 OFS OFS

karakfa
  • 66,216
  • 7
  • 41
  • 56
1
join -j1 -a1 -t';' <(cut -d';' -f 1,3,5 A | sed -e 's/;//g' | paste -d';' - A | sort ) <(cut -d';' -f 1,2,4 B | sed -e 's/;//g' | paste -d';' - B | sort ) | cut -d';' -f2,3,4,5,6,9,11

#the commands on new lines for readability only
#join command, print all of file A, even if unmatching
join -j1 -a1 -t';'
#input from file A
<(cut -d';' -f 1,3,5 A | sed -e 's/;//g' | paste -d';' - A | sort )
#input from file B
<(cut -d';' -f 1,2,4 B | sed -e 's/;//g' | paste -d';' - B | sort )
#selecting the columns
| cut -d';' -f2,3,4,5,6,9,11

in each case:

1) create a dummy field from the desired columns of file A or B

2) then use paste to create each pseudo file as dummy comparison field; rest of file

3) sort the output for usability with join

4) use join on basis of the dummy field

5) cut the desired columns from the matches join produces

FelixJN
  • 540
  • 3
  • 16
0

Not sure if I understood the requirement properly, but this gives the expected output for the given input:

awk -F \; -v OFS=\; 'FNR==NR{c[$1]=$3 FS $5;next}{ print $0, $1 in c ? c[$1] : ";"}' B A
A1;A2;A3;A4;A5;;
K1;D1;K2;D2;K3;D9;D0
K4;D3;K5;D4;K6;DA;DB
K7;D5;K8;D6;K9;;
K1;D7;K2;D8;K3;D9;D0

From the code in the question, I changed the print statement from

print $0, c[$1]

to

print $0, $1 in c ? c[$1] : ";"
user000001
  • 32,226
  • 12
  • 81
  • 108