0

I need help identifying how to count the frequency of duplicate information in a file. For example:

0
0
14
14
10
10
10

Here I would like to have a UNIX command to tell me how many times I had a number repeated 2 times and tell me how many times I had a number repeated more than 2 times within a file.

For example, this command would use the above data and yield an output that tells me there were 2 unique numbers repeated 2 times in the file (0 and 14 each two times in the data set) and 1 unique number that was repeated more than 2 times in the file (10 occurred more than two times in the data set).

bjb568
  • 11,089
  • 11
  • 50
  • 71
bjb125
  • 1
  • Are the numbers one per line or several, and are there only numbers in the file? – Wintermute Jan 13 '15 at 16:20
  • Please check the answer in this link http://stackoverflow.com/questions/6712437/find-duplicate-lines-in-a-file-and-count-how-many-time-each-line-was-duplicated Thanks, Anand – Anand Gangadhara Jan 13 '15 at 16:34
  • Wintermute, the numbers are one per line. – bjb125 Jan 13 '15 at 16:38
  • Anand - This is not what I am looking for. Thank you though. I need an output similar to what this command would do: awk '{a[$0]++}END{for(x in a)b[a[x]]++;for(x in b)print b[x], x}' filename – bjb125 Jan 13 '15 at 16:39
  • The above command would yield the following output: 2 2 and 1 3 Meaning that there were 2 instances where a number was repeated twice and 1 instance where a number was repeated 3 times. I want the output to only show the number of times a number was repeated 2 times. I then want another command to only show the number of times a number was repeated more than 2 times. – bjb125 Jan 13 '15 at 16:44
  • edit your question to show the ACTUAL output you would expect given that specific input file. Don't just try to describe it in comments. – Ed Morton Jan 13 '15 at 19:22

2 Answers2

1

If you just want to know there were 2 numbers that appeared twice and 1 number that appeared thrice:

sort file | uniq -c | awk '{print $1}' | sort | uniq -c
  2 2
  1 3

If you want to know what the numbers are, I'd use perl:

perl -lne '
        $n{$_}++
    } END {
        push @{$aggregate{$n{$_}}}, $_ for keys %n; 
        $,="\t"; 
        print $_, scalar(@{$aggregate{$_}}), join(",",@{$aggregate{$_}}) for keys %aggregate
' file

outputs

3   1   10
2   2   0,14
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
0
$ cat tst.awk
{ cnt[$0]++ }
END {
    for (key in cnt)
        hits[cnt[key]]++

    for (c in hits)
        print hits[c], c
}
$
$ awk -f tst.awk file
2 2
1 3

ad if you want to know which values are associated with which counts:

$ cat tst.awk
{ cnt[$0]++ }
END {
    for (key in cnt) {
        c = cnt[key]
        hits[c]++
        vals[c] = (c in vals ? vals[c] "," : "") key
    }

    for (c in hits)
        print hits[c], c, vals[c]
}
$
$ awk -f tst.awk file
2 2 0,14
1 3 10
Ed Morton
  • 188,023
  • 17
  • 78
  • 185