Using AWK to merge two files based on multiple conditions

Question

I know this question has been asked several times before. Here is one example:

Using AWK to merge two files based on multiple columns

My goal is to print out columns 2, 4, 5 and 7 of file_b and columns 17 and 18 of file_a if the following match occurs: Columns 2, 6 and 7 of file_a.csv matches with Columns 2, 4 and 5 of file_b.csv respectively.

But no matter how much I try, I can't get it to work for my case. Here are my two files:

file_a.csv

col2, col6, col7, col17, col18
a, b, c, 145, 88
e, f, g, 101, 96
x, y, z, 243, 222

file_b.csv

col2, col4, col5, col7
a, b, c, 4.5
e, f, g, 6.3
x, k, l, 12.9

Output should look like this:

col2, col4, col5, col7, col17, col18
a, b, c, 4.5, 145, 88
e, f, g, 6.3, 101, 96

I tried this:

awk -F, -v RS='\r\n' 'NR==FNR{key[$2 FS $6 FS $7]=$17 FS $18;next} {if($2 FS $4 FS $5 in key); print $2 FS $4 FS $5 FS $7 FS key[$2 FS $6 FS $7]}' file_a.csv file_b.csv > out.csv

Currently the output I am getting is:

col2, col4, col5, col7,
a, b, c, 4.5,
e, f, g, 6.3,

In other words, col17 and col18 from file_a is not showing up.

Yesterday I asked a related question where I was having issues with line breaks. That got answered and solved but now I think this problem is related to checking the if condition.

Update: I am sharing links to truncated copies of the actual data. The only difference between these files and the actual ones are that the real ones have millions of rows. These ones only have 10 each.

file_a.csv

file_b.csv

Do you need `\r` characters in your Input_file(s)? If not then get rid of them first by using `tr -d '\r' < Input_file > temp && mv temp Input_file`. Also print `awk '{print $17,$18}' Input_file` to see that you are getting output as desiered ones, let us knoe then? — RavinderSingh13, Jan 21 '19 at 00:12
Thanks @RavinderSingh13. See my comment to Tiw's answer below. — Mishal Ahmed, Jan 21 '19 at 02:08

Til · Accepted Answer · 2019-01-21T02:04:00.777

Please try this (GNU sed):

awk 'BEGIN{RS="\r\n";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}'

This is the time BEGIN block kicks in. Also OFS kicks in.
When we are printing out many fields which separated by same thing, we can set OFS, and simply put comma between the things we want to print.

There's no need to check key in arr when you've assigned value for a key in the array,
by default, when arr[somekey] isn't assigned before, it's empty/"", and it evaluates to false in awk (0 in scalar context), and a non-empty string is evaluates to true (There's no literally true and false in awk).
(You used wrong array name, the $2,$6,$7 is the key in the array arr here. It's confusing to use key as array name.)

You can test some simple concept like this:

awk 'BEGIN{print arr["newkey"]}'

You don't need a input file to execute BEGIN block.

Also, you can use quotes sometimes, to avoid confusion and underlying problem.

Update: Your files actually ends in \n, if you can't be sure what the line ending is, use this:

awk 'BEGIN{RS="\r\n|\n|\r";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}' file_a.csv file_b.csv

or this (This one will ignore empty lines):

awk 'BEGIN{RS="[\r\n]+";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}' file_a.csv file_b.csv

Also, it's better to convert first to avoid such situations, by:

sed -i 's/\r//' files

Or you can use dos2unix command:

dos2unix file

It's a handy commandline tool do above thing only.
You can install it if you don't have it in your system yet.
Once converted, you don't need to assign RS in normal situations.

Not working. Now I am getting an empty file as the output. I did add the filenames at the end of your code: file_a.csv file_b.csv > out.csv — Mishal Ahmed, Jan 21 '19 at 01:37
Working perfectly now. But I am confused about \r and \n. Following RavinderSingh13's comment above, I tried removing "CR" and "LF" (as shown in Notepad++). i managed to remove CR but not LF. Am I correct that CR=/n and LF=/r? Is /n = newline? What is /r then? — Mishal Ahmed, Jan 21 '19 at 02:07
`CR`=`\r`, `LF`=`\n`. Different line endings. Conventional: Windows: `\r\n`, linux: `\n`, macOS: `\r` (I'm not sure about this one). It's different chars do same thing today. (In old days, carriage means to move to line beginning, return means to move to next line. ) @ahmed_m — Til, Jan 21 '19 at 02:09
Thanks. Final question: what does the [ ] and the + in RS="[\r\n]+" mean? — Mishal Ahmed, Jan 21 '19 at 02:16
@ahmed_m `[]` means any character in it (or any other when begins with `^` inside). `+` means `one or more occurances`. Search `Regex` for more. — Til, Jan 21 '19 at 02:18

score 0 · Answer 2 · answered Jan 21 '19 at 01:54

$ awk 'BEGIN      {RS="\r\n"; FS=OFS=","}
       NR==FNR    {a[$2,$6,$7]=$17 OFS $18; next} 
  ($2,$4,$5) in a {print $2,$4,$5,$7,a[$2,$4,$5]}' file1 file2 > output

Your main issue is, in the array lookup the index you should use is the second file key, not the first file key. Also the semicolon after the if condition is wrong. The rest is cosmetics only.

Not sure you want the output \r\n terminated, if so set ORS=RS as well, otherwise it's newline only.

score 0 · Answer 3 · answered Jan 21 '19 at 09:26

Since you have mentioned that the file is huge, you can give a try to Perl, if that is an option.

The files are assumed to have "\r".

$ cat file_a.csv
col2, col6, col7, col17, col18
a, b, c, 145, 88
e, f, g, 101, 96
x, y, z, 243, 222
$ cat file_b.csv
col2, col4, col5, col7
a, b, c, 4.5
e, f, g, 6.3
x, k, l, 12.9
$ perl -F, -lane 'BEGIN { %kv=map{chomp;chop;@a=split(",");"$a[0],$a[1],$a[2]"=>"$a[3]"} qx(cat file_b.csv) } if($.>1){ $x="$F[0],$F[1],$F[2]";chomp($F[-1]);print "$x,$kv{$x}",join(",",@F[-2,-1]) if $kv{$x} } ' file_a.csv
a, b, c, 4.5 145, 88
e, f, g, 6.3 101, 96
$

Using AWK to merge two files based on multiple conditions

3 Answers3