Join more than two files with awk (or any other unix command) on unsorted column

Question

I have some 4 files (say A,B,C,D) with one column(mac address)

**file A**  
ej  
j8  
00  
5h  
fl  

**file B**  
ej  
6o  
00  
jq  
j6  

**file C**  
ej  
85  
54  
5e  
f9  

**file D**  
ej  
j8  
70  
5e  
70

where file A is my primary file.

A mac address from A should not be present in any other files B, C and D:

If it does, remove it.
Or can we create a new column with 'Y'/'N' flag values to know present or not.

*Please note that this column could not be sorted.

Expected Output:

5h
fl

It would be great if you can include a method to specify the column numbers of files if there are more than one columns present.

Why is expected outout only `c`? There are no `a`s in `C` and `E` as well — oguz ismail, Aug 29 '19 at 13:45
I want entities from A which are not present in any of the other files. — ram_23, Aug 29 '19 at 13:47
Still not clear. I think you should provide better examples. Like, `B: a,h,j`, is this supposed to mean `B` has one comma separated row, or it has 3 rows and first column `a`, `h` and `j` are the values in the first column? — oguz ismail, Aug 29 '19 at 13:57
@oguzismail sorry, my bad, I have edited the question, for this I don't care about other columns, please see now. — ram_23, Aug 29 '19 at 14:03
If you don't want the MAC address from A in any of the other files, could you "clean" the other files using "grep -v MAC_address_from_A file_B >> tmp.out && mv tmp.out file_B" — cowboydan, Aug 29 '19 at 14:09
@cowboydan I have 2 doubts with that, 1. Is using ```grep``` better than ```awk```? 2. I want the remaining entities from file A, it kinda seems like I'll get the cleaned file B from your command, please explain if I understood wrong. — ram_23, Aug 29 '19 at 14:16
@ram_23 - Please disregard my comment - I think I misunderstood your desire (after I re-read your example). — cowboydan, Aug 29 '19 at 17:04
Even though I already gave an answer, it would be nice if you could [edit] your question and add some example input and corresponding example output. The way it is written down now is really unclear. — kvantour, Aug 29 '19 at 21:46
@kvantour Thanks for suggesting editing rather than downvoting it, I did some corrections, please see if it's clear now. — ram_23, Aug 30 '19 at 07:37
Your question is still not clear, please edit and provide **(A)** example input which actually contains MAC-addresses **(B)** in a comment on a delete post you mentioned that you want to be able to specify the mac address column, make your input reflect that **(C)** what did you attempt yourself? — kvantour, Aug 30 '19 at 08:40

kvantour · Accepted Answer · 2019-08-29T14:58:50.043

My suggestion would be something like this:

awk '(NR==FNR){a[$1]=$0;next}
     ($1 in a){delete a[$1]}
     END{for(i in a) print a[i]}' file_a file_b file_c ...

Here we assumed that the key in all files is $1 (i.e. the mac-address). The code works in the following way:

(NR==FNR){a[$1]=$0;next}: when reading the first file (file A), store its records/lines in an array indexed by the mac address located in field 1. Use next to skip any further processing and move to the next record/line.
($1 in a){delete a[$1]}: for any other file, check if the key (mac address) is part of the array a. If it is, it means we can remove it from the array, as we are not interested in it.
END{for(i in a) print a[i]}: at the end, when all files are processed, check which mac addresses are still available in in the array. This means these are the arrays which are in file a but not in any of the other files. Print them. (be aware, they will not be printed in the same order of file a)

If $1 is not always the key, but each line has one mac-address somewhere, we can pick it up with a regex:

awk 'BEGIN{ere_mac = "[0-9A-Fa-f][0-9A-Fa-f][-:]"
           ere_mac = ere_mac ere_mac ere_mac ere_mac ere_mac;
           ere_mac = ere_mac "[0-9A-Fa-f][0-9A-Fa-f]"}
     { match($0,ere_mac); key=substr($0,RSTART,RLENGTH)}
     (NR==FNR) { a[key]=$0 }
     (key in a) { delete a[key] }
     END { for(i in a) print a[i] }' file_a file_b file_c ...

note: this is a very complicated way to build ere_mac, but it works if your awk does not accept grouping and repetitions. otherwise use ere_mac=([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2})

A complete alternative and more simple way would be:

grep -vFf <(awk '{print $1}' file_b file_c ...) file_a

Join more than two files with awk (or any other unix command) on unsorted column

1 Answers1