-3

I am now facing a file trimming problem. I would like to trim rows in a tab-delimited file.

The rule is: for rows which with the same value in two columns, preserve only the row with the largest value in the third column. There may be different numbers of such redundant rows defined by two columns. If there is a tie for the largest value in the third column, preserve the first one (after ordering the file).

(1) My file looks like (tab-delimited, with several millions of rows):

1 100 25 T
1 101 26 A
1 101 27 G
1 101 30 A
1 102 40 A
1 102 40 T

(2) The output I want:

1 100 25 T
1 101 30 A
1 102 40 T

This problem is faced by my real study, not home-work. I expect to have your helps on that, because I have restricted programming skills. I prefer an computation-efficient way, because there is so many rows in my data file. Your help will be very valuable to me.

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
jianfeng.mao
  • 945
  • 3
  • 13
  • 21
  • Is the input file already sorted? – Jonathan Hall Jun 24 '11 at 10:12
  • On python I suggest http://docs.python.org/dev/py3k/library/csv.html#module-csv to parse tab-delimited files. Do you need a more specific answer? – Evpok Jun 24 '11 at 10:12
  • And I assume you're looking for an answer in any language that works, since you didn't specify otherwise, and you tagged quite a number of languages. – Jonathan Hall Jun 24 '11 at 10:14
  • 15
    jianfeng.mao, [your recent questions](http://stackoverflow.com/users/574090) are essentially all of the type "write the code for me"; this is not good for the Stack Overflow community. Please, before asking, make a real effort to solve a problem as good as you can, then document the progress which you have made. Show the source code of what you already managed to write. Say what went wrong, and which specific problem you encountered. – Earlier appeals: [①](http://stackoverflow.com/q/6307788#comment-7438638) [②](http://stackoverflow.com/q/6355349#comment-7439024) – daxim Jun 24 '11 at 10:19
  • 1
    Why is 1-102-40-T is chosen over 1-102-40-A ? You have to preserve the first one? – Priyank Bhatnagar Jun 24 '11 at 12:45
  • Dear daxim, I am sorry I can not do that by myself. I can only contribute the questions/problems. I have few ability in programming, though computations were included in study. I can only do some very simple things, like straight-forward replace, rearrange. I learned R by myself, but it can not work on big data (GBs). In my opinion, questions/problems like scripts used to solve them, are also helpful for others. I have not stopped my self-learning of programming. – jianfeng.mao Jun 24 '11 at 14:12
  • In the meantime I am learning programming, I do not want to stop putting forward my questions/problems. That will also helpful for others and this (StackEXchange) question-directed community. – jianfeng.mao Jun 24 '11 at 14:21
  • A fine reply, thank you for taking the time to speak your mind. You should learn programming in a structured way, for example by attending a class or self-study with [books](http://stackoverflow.com/q/194812); you'll gain a more useful, broad knowledge this way compared to perusing code examples and discussion on Web forums. – daxim Jun 24 '11 at 18:25

4 Answers4

2

Here's a solution that will rely on the input file already being sorted appropriately. It will scan line-by-line for lines with similar start (e.g. two first columns identical), check the third column value and preserve the line with the highest value - or the line that came first in the file. When a new start is found, it prints the old line, and begins checking again.

At the end of the input file, the max line in memory is printed out.

use warnings;
use strict;

my ($max_line, $start, $max) = parse_line(scalar <DATA>);
while (<DATA>) {
    my ($line, $nl_start, $nl_max) = parse_line($_);
    if ($nl_start eq $start) {
        if ($nl_max > $max) {
            $max_line = $line;
            $max = $nl_max;
        }
    } else {
        print $max_line;
        $start = $nl_start;
        $max = $nl_max;
        $max_line = $line;
    }
}

print $max_line;

sub parse_line {
    my $line = shift;
    my ($start, $max) = $line =~ /^([^\t]+\t[^\t]+\t)(\d+)/;
    return ($line, $start, $max);
}
__DATA__
1   100 25  T
1   101 26  A
1   101 27  G
1   101 30  A
1   102 40  A
1   102 40  T

The output is:

1       100     25      T
1       101     30      A
1       102     40      A

You stated

If there is a tie for the largest value in the third column, preserve the first one (after ordering the file).

which is rather cryptic. Then you asked for output that seemed to contradict this, where the last value was printed instead of the first.

I am assuming that what you meant is "preserve the first value". If you indeed meant "preserve the last value", then simply change the > sign in if ($nl_max > $max) to >=. This will effectively preserve the last value equal instead of the first.

If you however implied some kind of sort, which "after ordering the file" seems to imply, then I do not have enough information to know what you meant.

TLP
  • 66,756
  • 10
  • 92
  • 149
1

Here's one approach

use strict;
use warnings;
use constant 
    { LINENO => 0
    , LINE   => 1
    , SCORE  => 2
    };
use English qw<$INPUT_LINE_NUMBER>;

my %hash;
while ( <> ) { 
    # split the line to get the fields
    my @fields = split /\t/;
    # Assemble a key for everything except the "score"
    my $key    = join( '-', @fields[0,1] );
    # locally cache the score
    my $score  = $fields[SCORE];

    # if we have a score, and the current is not greater, then next
    next unless ( $hash{ $key } and $score > $hash{ $key }[SCORE];
    # store the line number, line text, and score
    $hash{ $key } = [ $INPUT_LINE_NUMBER, $_, $score ]; 
}

# sort by line number and print out the text of the line stored.
foreach my $struct ( sort { $a->[LINENO] <=> $b->[LINENO] } values %hash ) {
    print $struct->[LINE]; 
}
Axeman
  • 29,660
  • 2
  • 47
  • 102
  • In your solution, `1 101 27 G` and `1 101 30 A` would be treated as two different `$key`s, `1-101-G` and `1-101-A`. Also, this solution may not be viable for "several millions of rows". – TLP Jun 24 '11 at 12:31
  • @TLP, true. I wandered away from the specs, there. – Axeman Jun 24 '11 at 16:42
1

In python you can try the following code:

res = {}
for line in (line.split() for line in open('c:\\inpt.txt','r') if line):
    line = tuple(line)
    if not line[:2] in res:
        res[line[:2]] = line[2:]
        continue
    elif res[line[:2]][0] <= line[3]:
        res[line[:2]] = line[2:]

f = open('c:\\tst.txt','w')
[f.write(line) for line in ('\t'.join(k+v)+'\n' for k,v in res.iteritems())]
f.close()
Artsiom Rudzenka
  • 27,895
  • 4
  • 34
  • 52
1

In Python too, but cleaner imo

import csv
spamReader = csv.reader(open('eggs'), delimiter='\t')
select = {}
for row in spamReader:
    first_two, three = (row[0], row[1]), row[2]
    if first_two in select:
         if select[first_two][2] > three:
             continue
    select[first_two] = row

spamWriter = csv.writer(open('ham', 'w'), delimiter='\t')
for line in select:
    spamWrite.writerow(select[line])
Evpok
  • 4,273
  • 3
  • 34
  • 46