-1

I have a CSV file where I need to dedupe entries where the FIRST field matches, even if the other fields don't match. In addition, the line that is left should be the one where one of the other fields with the highest date.

This what my data looks like:

"47917244","000","OTC","20180718","7","2018","20180719","47917244","20180719"
"47917244","000","OTC","20180718","7","2018","20180731","47917244","20180731"
"47917244","000","OTC","20180718","7","2018","20180830","47917244","20180830"

All 3 lines have the same value in the first field. The 9th field is a date field, and I want dedupe it in such a way that the third line, which has the highest date value, is kept, but the other two lines are deleted.

KamilCuk
  • 120,984
  • 8
  • 59
  • 111
hamayoun
  • 159
  • 1
  • 8
  • 1
    Did you check the ```-f``` flag of ```uniq```? – accdias Apr 26 '19 at 20:14
  • You are going to get better results if you post what you've tried so far. – accdias Apr 26 '19 at 20:22
  • 1
    @accdias That's the opposite of what he wants. `-f 1` ignores the first field, but the first field is the one he wants to make unique. – Barmar Apr 26 '19 at 20:24
  • @Barmar, I said to check ```-f``` and not ```-f 1```. – accdias Apr 26 '19 at 20:27
  • From `man uniq` concerning the `-f` flag: *"avoid comparing the first N fields"* perhaps I'm missing something but how would this be helpful when it's the "first N fields" that op NEEDS to compare. `Sort` is clearly going to yield a better result. – JNevill Apr 26 '19 at 20:29

1 Answers1

0

After checking another stackoverflow post (Is there a way to 'uniq' by column?), I got it working by using a mix of sort and awk:

sort -t, -u -k1,1 -k9,9 <file> |
    awk -F',' '{ x[$1]=$0 } END { for (i in x) print x[i] }'
Barmar
  • 741,623
  • 53
  • 500
  • 612
hamayoun
  • 159
  • 1
  • 8