Deduplicating by first field in awk

Question

I'm looking for a modified version of the top answer to this question:

extracting unique values between 2 sets/files

awk 'FNR==NR {a[$0]++; next} !($0 in a)' file1 file2

How do i accomplish the same thing by deduplicating on field one, instead of the entire line?

File format is the following:

blah@domain.com,Elon,Tusk

I want to output only the lines from file 2 which have emails unique to file 1.

The ideal solution would allow for multiple files, rather than only 2, which all duplicated against the files before it, so you could do:

awk .... file1 file2 file3 file4 file5 file6

and somehow output 6 new files containing rows with only unique first fields to all other files before it

However, if that's too complex, just working on 2 files is fine as well

@kvantour would love it if you wouldn't mind explaining how this works? — Underflow, Jul 22 '20 at 07:47
Based on your input file, you seem to have a sequence of comma's and spaces as delimiter, hence we use that as field separator `FS`. We now only select on the first field (`$1`) to be mentioned in `file1` so we can do : `awk 'BEGIN{FS="[ \t,]+"}{a[$1]; next}!($1 in a)' file1 file2`. There is also no need to do the `a[$1]++`, `a[$1]` is enough which just creates an entry in the array `a`. There is no need to count how many times you encounter `$1` as you are not interested in it. (**note** this only works for one file) — kvantour, Jul 22 '20 at 07:53
@kvantour the spaces were a mistake, sorry i have fixed that, the file doesn't have spaces in it — Underflow, Jul 22 '20 at 08:08

kvantour · Accepted Answer · 2020-07-22T09:21:43.800

0

Based on the input you provided and the requests you made, we can make the following awk script:

awk 'BEGIN{FS=","}
    (FNR==1){close(f); f=FILENAME ".new"}
    !(a[$1]++) { print > f }' file1 file2 file3 file4 file5

This will create 5 files called file[12345].new. Each of these files will contain the lines with a unique first column. Note that it is evident that file1.new and file1 are identical (with the exception if there are duplicates in file1)

edited Jul 22 '20 at 09:21

answered Jul 22 '20 at 09:15

kvantour

25,269
4
47
72

this is amazing, i will try it ASAP – Underflow Jul 22 '20 at 09:28

Deduplicating by first field in awk

1 Answers1