2

I have a csv separated with ;. I need to remove lines where content of 2nd and 3rd column is not unique, and deliver the material to the standard output.

Example input:

irrelevant;data1;data2;irrelevant;irrelevant  
irrelevant;data3;data4;irrelevant;irrelevant  
irrelevant;data5;data6;irrelevant;irrelevant  
irrelevant;data7;data8;irrelevant;irrelevant  
irrelevant;data1;data2;irrelevant;irrelevant  
irrelevant;data9;data0;irrelevant;irrelevant  
irrelevant;data1;data2;irrelevant;irrelevant  
irrelevant;data3;data4;irrelevant;irrelevant  

Desired output

irrelevant;data5;data6;irrelevant;irrelevant  
irrelevant;data7;data8;irrelevant;irrelevant  
irrelevant;data9;data0;irrelevant;irrelevant  

I have found solutions where only first line is printed to the output:

sort -u -t ";" -k2,1 file  

but this is not enough.

I have tried to use uniq -u but I can't find a way to check only a few columns.

jaypal singh
  • 74,723
  • 23
  • 102
  • 147
xpdude
  • 23
  • 1
  • 7

3 Answers3

5

Using awk:

awk -F';' '!seen[$2,$3]++{data[$2,$3]=$0}
      END{for (i in seen) if (seen[i]==1) print data[i]}' file
irrelevant;data5;data6;irrelevant;irrelevant
irrelevant;data7;data8;irrelevant;irrelevant
irrelevant;data9;data0;irrelevant;irrelevant

Explanation: If $2,$3 combination doesn't exist in seen array then a new entry with key of $2,$3 is stored in data array with whole record. Every time $2,$3 entry is found a counter for $2,$3 is incremented. Then in the end those entries with counter==1 are printed.

anubhava
  • 761,203
  • 64
  • 569
  • 643
-1

If order is important and if you can use perl then:

perl -F";" -lane '
    $key = @F[1,2]; 
    $uniq{$key}++ or push @rec, [$key, $_] 
}{ 
    print $_->[1] for grep { $uniq{$_->[0]} == 1 } @rec' file
irrelevant;data5;data6;irrelevant;irrelevant  
irrelevant;data7;data8;irrelevant;irrelevant  
irrelevant;data9;data0;irrelevant;irrelevant  

We use column2 and column3 to create composite key. We create array of array by pushing the key and the line to array rec for the first occurrence of the line.

In the END block, we check if that occurrence is the only occurrence. If so, we go ahead and print the line.

jaypal singh
  • 74,723
  • 23
  • 102
  • 147
-1
awk '!a[$0]++' file_input > file_output

This worked for me. It compares whole lines.

0x5C91
  • 3,360
  • 3
  • 31
  • 46