0

I have a tsv file with data from some event participants. Here is a small snippet from it:

...
sub-09          37   F    19780726   20160328    20160329
sub-10          38   F    19780208   20160406    20160407
sub-11          39   M    19770511   20160704    20160705
...
sub-42          37   F    19780726   20160328    20160329
...

Note that sub-09 and sub-42 are duplicates.

In BASH, how can I find duplicate lines but ignoring the first (or in general any other) column? I've seen similar threads, e.g., this one, but I couldn't find an answer that fits. Ideally I would get both occurrences of all duplicates, as in:

Expected output:

sub-09          37   F    19780726   20160328    20160329
sub-42          37   F    19780726   20160328    20160329
Daniel
  • 11,332
  • 9
  • 44
  • 72

3 Answers3

2

Use uniq -d to show duplicates. Use its -f option to skip fields. As uniq needs the input sorted, first sort ignoring the first column:

sort -nk2 file | uniq -f1 -d

Use -D instead of -d if you want all the duplicates.

choroba
  • 231,213
  • 25
  • 204
  • 289
1

Here is an awk based solution that avoids sorting the file (which can be pretty expensive for a large file):

awk '{
   p = $1
   $1 = ""
   freq[$0]++
   col1[$0,freq[$0]] = p
} 
END {
   for (i in freq)
      for (j=1; freq[i]>1 && j<=freq[i]; j++)
         print col1[i,j] i
 }' file

sub-09 37 F 19780726 20160328 20160329
sub-42 37 F 19780726 20160328 20160329
anubhava
  • 761,203
  • 64
  • 569
  • 643
0
awk 'FNR==NR{$1="";a[$0]++;next}{s=$0;$1="";if(a[$0]>=2) print s}' file file
zxy
  • 148
  • 1
  • 2