Bash/Shell: How to remove duplicates from csv file by columns?

Question

I have a csv separated with ;. I need to remove lines where content of 2nd and 3rd column is not unique, and deliver the material to the standard output.

Example input:

irrelevant;data1;data2;irrelevant;irrelevant  
irrelevant;data3;data4;irrelevant;irrelevant  
irrelevant;data5;data6;irrelevant;irrelevant  
irrelevant;data7;data8;irrelevant;irrelevant  
irrelevant;data1;data2;irrelevant;irrelevant  
irrelevant;data9;data0;irrelevant;irrelevant  
irrelevant;data1;data2;irrelevant;irrelevant  
irrelevant;data3;data4;irrelevant;irrelevant

Desired output

irrelevant;data5;data6;irrelevant;irrelevant  
irrelevant;data7;data8;irrelevant;irrelevant  
irrelevant;data9;data0;irrelevant;irrelevant

I have found solutions where only first line is printed to the output:

sort -u -t ";" -k2,1 file

but this is not enough.

I have tried to use uniq -u but I can't find a way to check only a few columns.

in all the lines there isn't an unique value in the 2nd and 3rd columns. — Avinash Raj, Aug 22 '14 at 15:28
I agree with @jaypal, that question is about finding unique records only. — anubhava, Aug 22 '14 at 15:32
@AvinashRaj: OP wants to list those records where `col2, col3` appear only once in whole file. — anubhava, Aug 22 '14 at 15:38
Yes, @anubhava is right. Storing the material in some temporary template seems to be the only way. It seems both awk and perl solutions are very similar. — xpdude, Aug 23 '14 at 00:19

anubhava · Accepted Answer · 2014-08-22T15:45:48.767

Using awk:

awk -F';' '!seen[$2,$3]++{data[$2,$3]=$0}
      END{for (i in seen) if (seen[i]==1) print data[i]}' file
irrelevant;data5;data6;irrelevant;irrelevant
irrelevant;data7;data8;irrelevant;irrelevant
irrelevant;data9;data0;irrelevant;irrelevant

Explanation: If $2,$3 combination doesn't exist in seen array then a new entry with key of $2,$3 is stored in data array with whole record. Every time $2,$3 entry is found a counter for $2,$3 is incremented. Then in the end those entries with counter==1 are printed.

score -1 · Answer 2 · answered Aug 22 '14 at 16:04

If order is important and if you can use perl then:

perl -F";" -lane '
    $key = @F[1,2]; 
    $uniq{$key}++ or push @rec, [$key, $_] 
}{ 
    print $_->[1] for grep { $uniq{$_->[0]} == 1 } @rec' file
irrelevant;data5;data6;irrelevant;irrelevant  
irrelevant;data7;data8;irrelevant;irrelevant  
irrelevant;data9;data0;irrelevant;irrelevant

We use column2 and column3 to create composite key. We create array of array by pushing the key and the line to array rec for the first occurrence of the line.

In the END block, we check if that occurrence is the only occurrence. If so, we go ahead and print the line.

score -1 · Answer 3 · edited Mar 25 '15 at 00:15

-1

awk '!a[$0]++' file_input > file_output

This worked for me. It compares whole lines.

edited Mar 25 '15 at 00:15

0x5C91

3,360
3
31
46

answered Mar 24 '15 at 23:33

Andrey Strelnikov

1

Bash/Shell: How to remove duplicates from csv file by columns?

3 Answers3