Keep only the line that is latest in the file and is a duplicate based on two fields

Question

This is related to the questions

I have a file like this:

FOO,BAR,100,200,300
BAZ,TAZ,500,600,800
FOO,BAR,900,1000,1000
HERE,THERE,1000,200,100
FOO,BAR,100,10000,200
BAZ,TAZ,100,40,500

The duplicates are determined by the first two fields. In addition, the more "recent" record (lower in the file / higher line number) is the one that should be retained.

What is an awk script that will output:

BAZ,TAZ,100,40,500
FOO,BAR,100,10000,200
HERE,THERE,1000,200,100

Output order is not so important.

Explanation of awk syntax would be great.

score 2 · Answer 1 · answered Apr 05 '13 at 22:27

2

This is easy in awk : we just need to feed an array with a key combined with the 1st and the 2nd columns and the rest as values :

$ awk -F, '{a[$1","$2]=$3","$4","$5}END{for(i in a)print i,a[i]}' OFS=, file.txt
BAZ,TAZ,100,40,500
HERE,THERE,1000,200,100
FOO,BAR,100,10000,200

answered Apr 05 '13 at 22:27

Gilles Quénot

173,512
41
224
223

Can you please elaborate or link to an explanation of arrays in awk and how they help solve this problem? – noahlz Apr 05 '13 at 22:41
1

An associative array have unique keys by nature. That's all the magic. Since we keep only the latest values, we just need to iterate over the lines and @the end, display the array line by lines – Gilles Quénot Apr 05 '13 at 22:53

score 1 · Accepted Answer · answered Apr 06 '13 at 17:27

1

This might work for you (tac and GNU sort):

tac file | sort -sut, -k1,2

answered Apr 06 '13 at 17:27

potong

55,640
6
51
83

Ok, you demonstrated a solution that doesn't even use awk, and now I know about `tac` and some `sort` options that weren't obvious. Winner. – noahlz Apr 08 '13 at 02:29

Keep only the line that is latest in the file and is a duplicate based on two fields

2 Answers2