7
    1.gui  Qxx  16
    2.gu   Qxy  23
    3.guT  QWS  18
    4.gui  Qxr  21

i want to sort a file depending a value in the 3rd column, so i use:

sort -rnk3 myfile

2.gu   Qxy  23
4.gui  Qxr  21
3.guT  QWS  18
1.gui  Qxx  16

now i have to output as: (the line starting with 3.gui is out because the line with 4.gui has a greater value)

2.gu   Qxy  23
4.gui  Qxr  21
1.guT  QWS  18

i can not use -head because i have millions of rows and i do not where to cut, i could not figure a way to use -uniq because it treats a line as whole and since i can not tell -uniq to look at first column, it counts a line which has unique it outputs it -which is normal-. i know -uniq can ignore a number of characters but as you can see from example first column might have various character count..

please advice..

teutara
  • 605
  • 4
  • 12
  • 24

3 Answers3

9

Try this:

sort -rnk3 myfile | awk -F"[. ]" '!a[$2]++'

awk removes the duplicates depending on the 2nd column. This is actually a famous awk syntax to remove duplicates. An array is maintained where the record of 2nd field is maintained. Every time before a record is printed, the 2nd field is checked in the array. If not present, it is printed, else its discarded since it is duplicate. This is achived using the ++. First time, when a record is encountered, this ++ will keep the count as 0 since its post-fix. SUbsequent occurences will increase the value which when negated becomes false.

Guru
  • 16,456
  • 2
  • 33
  • 46
  • 2nd column because we are splitting the file with . and space as delimiter, and hence 2nd column will give us gui,etc.. – Guru Nov 27 '12 at 12:06
2

Here you go:

sort -rnk3 file | awk -F'[. ]' '{ if (a[$2]++ == 0) print }' 

2.gu   Qxy  23
4.gui  Qxr  21
1.guT  QWS  18

This uses awk to check duplicate values in the second field where by the field separator is either a whitespace or a period. So this is what it treats the second field as:

$ awk -F'[. ]' '{ print $2 }' file

gu
gui
guT
gui

In awk the variable $0 represents the whole line, $1 represents the first field, and so on..

awk -F'[. ]' '{ if (a[$2]++ == 0) print }' the -F options let you specify the field separator, in this case it's either whitespace or a period.

Chris Seymour
  • 83,387
  • 30
  • 160
  • 202
0

So I found this by the all powerful and amazing Google -- My little script builds off @sudo_O 's answer, in that it shows you all the duplicate lines found...., not a file without duplicates.

The text I was finding all duplicates in the 3rd column (port) were in a file called master.txt

awk '{if (a[$3]++ > 0) print}' master.txt | while read site thread port
do
  grep $port master.txt
done
Ziferius
  • 91
  • 1
  • 6