I am now facing a file trimming problem. I would like to trim rows in a tab-delimited file.
The rule is: for rows which with the same value in two columns, preserve only the row with the largest value in the third column. There may be different numbers of such redundant rows defined by two columns. If there is a tie for the largest value in the third column, preserve the first one (after ordering the file).
(1) My file looks like (tab-delimited, with several millions of rows):
1 100 25 T
1 101 26 A
1 101 27 G
1 101 30 A
1 102 40 A
1 102 40 T
(2) The output I want:
1 100 25 T
1 101 30 A
1 102 40 T
This problem is faced by my real study, not home-work. I expect to have your helps on that, because I have restricted programming skills. I prefer an computation-efficient way, because there is so many rows in my data file. Your help will be very valuable to me.