Extracting unique values between 2 files with awk

Question

I need to get uniq lines when comparing 2 files. These files containing field separator ":" which should be treated as the end of line while comparing strings.

The file1 contains these lines

apple:tasty
apple:red
orange:nice
kiwi:awesome
kiwi:expensive
banana:big
grape:green
orange:oval
banana:long

The file2 contains these lines

orange:nice
banana:long

The output file should be (2 occurrences of orange and 2 occurrences of banana deleted)

apple:tasty
apple:red
kiwi:awesome
kiwi:expensive
grape:green

So the only strings before : should be compared

Is it possible to complete this task in 1 command ?

I tried to complete the task in such way but field separator does not work in that situation.

awk -F: 'FNR==NR {a[$0]++; next} !a[$0]' file1 file2 > outputfile

score 2 · Accepted Answer · answered Aug 25 '19 at 00:13

2

You basically had it, but $0 refers to the whole line when you want to deal with only the first field, which is $1.

Also you need to take care with the order of the input files. To use the values from file2 for deciding which lines to include from file1, process file2 first:

$ awk -F: 'FNR==NR {a[$1]++; next} !a[$1]' file2 file1
apple:tasty
apple:red
kiwi:awesome
kiwi:expensive
grape:green

answered Aug 25 '19 at 00:13

jas

10,715
2
30
41

thanks you, exactly wanted result! Actually i have referenced by this similar question https://stackoverflow.com/questions/4717250/extracting-unique-values-between-2-sets-files My bad i forgot about $0 which states full line. – Katrin Izengard Aug 25 '19 at 01:01

score 0 · Answer 2 · answered Aug 25 '19 at 05:16

0

One comment: awk is very ineffective with arrays. In real life with big files, better use something like:

comm -3 <(cut -d : -f 1 f1 | sort -u) <(cut -d : -f 1 f2 | sort -u) | grep -h -f /dev/stdin f1 f2

answered Aug 25 '19 at 05:16

Eran Ben-Natan

2,515
2
16
19

what the exact problem may cause awk use on big files ? What the file size limit to get it work as expected ? – Katrin Izengard Aug 25 '19 at 14:39
First, it is very slow as awk searches the arrays sequentially. Second, it keeps all the array in memory. It is hard (not to say impossible) to tell the limit as its depends on your HW. Anyway, if your files are not hundreds of mega or more, you are probably OK. – Eran Ben-Natan Aug 26 '19 at 11:03
@EranBen-Natan, awk's associative arrays are implemented as hash tables with constant time lookup; the arrays are not searched sequentially. On your second point, I completely agree --- the whole array must fit in memory and this can be a problem for humongous data sets. – jas Aug 26 '19 at 17:05
Thanks, @jas , that's news for me! – Eran Ben-Natan Aug 28 '19 at 04:47

Extracting unique values between 2 files with awk

2 Answers2