Sort and remove duplicates based on column

Question

I have a text file:

$ cat text
542,8,1,418,1
542,9,1,418,1
301,34,1,689070,1
542,9,1,418,1
199,7,1,419,10

I'd like to sort the file based on the first column and remove duplicates using sort, but things are not going as expected.

Approach 1

$ sort -t, -u -b -k1n text
542,8,1,418,1
542,9,1,418,1
199,7,1,419,10
301,34,1,689070,1

It is not sorting based on the first column.

Approach 2

$ sort -t, -u -b -k1n,1n text
199,7,1,419,10
301,34,1,689070,1
542,8,1,418,1

It removes the 542,9,1,418,1 line but I'd like to keep one copy.

It seems that the first approach removes duplicate but not sorts correctly, whereas the second one sorts right but removes more than I want. How should I get the correct result?

jaypal singh · Accepted Answer · 2013-07-25T03:12:09.987

4

The problem is that when you provide a key to sort the unique occurrences are looked for that particular field. Since the line 542,8,1,418,1 is displayed, sort sees the next two lines starting with 542 as duplicate and filters them out.

Your best bet would be to either sort all columns:

sort -t, -nk1,1 -nk2,2 -nk3,3 -nk4,4 -nk5,5 -u text

or

use awk to filter duplicate lines and pipe it to sort.

awk '!_[$0]++' text | sort -t, -nk1,1

edited Jul 25 '13 at 03:12

answered Jul 25 '13 at 02:19

jaypal singh

74,723
23
102
147

`uniq` requires the input file to be sorted. Is it possible that the output of `sort` based on column 1 is not sorted based on all columns? – Yang Jul 25 '13 at 02:25
My guess is that if I can sort based on (1,2,3,4,5) using `-nk1,5`, then `uniq` should work, but for some encrypted reasons it doesn't work. – Yang Jul 25 '13 at 02:36
@Yang Hmm, you can also do `awk '!_[$0]++' text | sort -t, -nk1,1` to first filter duplicate lines and then pipe that to sort. – jaypal singh Jul 25 '13 at 02:46
Thanks this does the trick. I have a remaining question why `-nk1,5` does not work? It is supposed to sort based on 1 first, then 2, etc, but the output is like approach 1. – Yang Jul 25 '13 at 02:56
1

@Yang That is not the right way to sort. You'll have to do `sort -t, -nk1,1 -nk2,2 -nk3,3 -nk4,4 -nk5,5 -u text` to sort for all the columns and then list `unique` lines from it. – jaypal singh Jul 25 '13 at 03:03

score 0 · Answer 2 · answered Jul 25 '13 at 02:13

0

When sorting on a key, you must provide the end of the key as well, otherwise sort uses all following keys as well.

The following should work:

sort -t, -u -k1,1n text

answered Jul 25 '13 at 02:13

choroba

231,213
25
204
289

Sort and remove duplicates based on column

Approach 1

Approach 2

2 Answers2

Linked