Linux: Dedupe based on specific fields

Question

I have a CSV file where I need to dedupe entries where the FIRST field matches, even if the other fields don't match. In addition, the line that is left should be the one where one of the other fields with the highest date.

This what my data looks like:

"47917244","000","OTC","20180718","7","2018","20180719","47917244","20180719"
"47917244","000","OTC","20180718","7","2018","20180731","47917244","20180731"
"47917244","000","OTC","20180718","7","2018","20180830","47917244","20180830"

All 3 lines have the same value in the first field. The 9th field is a date field, and I want dedupe it in such a way that the third line, which has the highest date value, is kept, but the other two lines are deleted.

You are going to get better results if you post what you've tried so far. — accdias, Apr 26 '19 at 20:22
@accdias That's the opposite of what he wants. `-f 1` ignores the first field, but the first field is the one he wants to make unique. — Barmar, Apr 26 '19 at 20:24
From `man uniq` concerning the `-f` flag: *"avoid comparing the first N fields"* perhaps I'm missing something but how would this be helpful when it's the "first N fields" that op NEEDS to compare. `Sort` is clearly going to yield a better result. — JNevill, Apr 26 '19 at 20:29

score 0 · Accepted Answer · edited Apr 26 '19 at 20:25

0

After checking another stackoverflow post (Is there a way to 'uniq' by column?), I got it working by using a mix of sort and awk:

sort -t, -u -k1,1 -k9,9 <file> |
    awk -F',' '{ x[$1]=$0 } END { for (i in x) print x[i] }'

edited Apr 26 '19 at 20:25

Barmar

741,623
53
500
612

answered Apr 26 '19 at 20:24

hamayoun

159
1
8

Linux: Dedupe based on specific fields

1 Answers1