0

I would like to collapse rows based on equality of first column. Then add the content of second column to the new collapsed table, comma-separated and with additional space. Also, if content of second column are the same, collapse them, that is, if 'non-virulent' appears two times in output file, show it just once.

I'm quite new here, please explain how to run it. Hope anyone can help me!

Input (tab-delimited):

HS372_01446 non-virulent
HS372_01446 non-virulent
HS372_01446 lung
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 lung
HS372_00498 lung
HS372_00954 jointlungCNS
HS372_00954 non-virulent
HS372_00954 non-virulent
HS372_00954 moderadamentevirulenta(nose)
HS372_00954 lung

Desired output (tab-delimited):

HS372_01446 non-virulent, lung
HS372_00498 non-virulent, lung
HS372_00954 jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung
biotech
  • 697
  • 1
  • 7
  • 17

8 Answers8

2

Perl from command line,

perl -lane'
  ($n, $p) =@F;
  $s{$n}++ or push @r, $n;
  $c{$n}{$p}++ or push @{$h{$n}}, $p;
  END {
    $" = ",\t";
    print "$_\t@{$h{$_}}" for @r;
  }
' file

output

HS372_01446     non-virulent,   lung
HS372_00498     non-virulent,   lung
HS372_00954     jointlungCNS,   non-virulent,   moderadamentevirulenta(nose),  lung
mpapec
  • 50,217
  • 8
  • 67
  • 127
2

Another Perl solution:

#!/usr/bin/perl
use strict;
use warnings;
use List::MoreUtils qw/uniq/;

my %hash;
while ( <DATA> )
{
    chomp;
    my ( $key, $value ) = split;
    push @{$hash{$key}}, $value;
}

while ( my ( $key, $values ) = each %hash )
{
    print "$key\t", join ', ', uniq @$values, "\n";  
}

__DATA__
HS372_01446 non-virulent
HS372_01446 non-virulent
HS372_01446 lung
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 lung
HS372_00498 lung
HS372_00954 jointlungCNS
HS372_00954 non-virulent
HS372_00954 non-virulent
HS372_00954 moderadamentevirulenta(nose)
HS372_00954 lung
Chris
  • 729
  • 6
  • 12
  • Dreadful variable names: `%hash` is as useful as `$scalar`. `List::MoreUtils` isn't a core module and may need installing. No point in the `chomp` as the `split` will ignore any whitespace. `"\t"` is rarely useful as it aligns output only for the most minimal set of data. But +1 because this is close to the best solution for an unsorted output. – Borodin Feb 12 '14 at 16:33
2

This does what you ask, and in addition keeps the IDs and descriptions in the same order that they appear in the file, in case that matters:

use strict;
use warnings;

open my $fh, '<', 'diseases.txt';

my %diseases;
my @ids;

while (<$fh>) {
  my ($id, $desc) = split;
  if (not $diseases{$id}) {
    $diseases{$id}{list} = [$desc];
    $diseases{$id}{seen}{$desc} = 1;
    push @ids, $id;
  }
  elsif (not $diseases{$id}{seen}{$desc}) {
    push @{ $diseases{$id}{list} }, $desc;
    $diseases{$id}{seen}{$desc} = 1;
  }
}

for my $id (@ids) {
  printf "%s %s\n", $id, join ', ', @{ $diseases{$id}{list} };
}

output

HS372_01446 non-virulent, lung
HS372_00498 non-virulent, lung
HS372_00954 jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung
Borodin
  • 126,100
  • 9
  • 70
  • 144
1
from collections import defaultdict

a = """HS372_01446 non-virulent
HS372_01446 non-virulent
HS372_01446 lung
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 lung
HS372_00498 lung
HS372_00954 jointlungCNS
HS372_00954 non-virulent
HS372_00954 non-virulent
HS372_00954 moderadamentevirulenta(nose)
HS372_00954 lung""".split("\n")

stuff = defaultdict(set)

for line in a:
    uid, symp = line.split(" ")
    stuff[uid].add(symp)

for uid, symps in stuff.iteritems():
    print "%s %s" % (uid, ", ".join(list(symps)))
Jakob Bowyer
  • 33,878
  • 8
  • 76
  • 91
  • Traceback (most recent call last): File "script.py", line 22, in uid, symp = line.split(" ") ValueError: need more than 1 value to unpack – biotech Feb 12 '14 at 10:31
1

Java:

javac Collapse.java

java Collapse input.txt

import java.io.*;
import java.util.*;

public class Collapse {

    public static void main(String[] args) throws Exception {
        BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(args[0])));

        Map<String, Set<String>> output = new HashMap<String, Set<String>>();
        String line;
        while ((line = br.readLine()) != null) {
            StringTokenizer st = new StringTokenizer(line, "\t");
            String key = st.nextToken();
            Set<String> set = output.get(key);
            if (set == null) {
                output.put(key, set = new LinkedHashSet<String>());
            }
            set.add(st.nextToken());
        }

        for (String key : output.keySet()) {
            StringBuilder sb = new StringBuilder();
            for (String value : output.get(key)) {
                if (sb.length() != 0) sb.append(", ");
                sb.append(value);
            }
            System.out.println(key + "\t" + sb);
        }
    }
}
gabor
  • 400
  • 3
  • 12
1

The standard UNIX tool for parsing text files is awk:

$ awk '!seen[$1,$2]++{a[$1]=(a[$1] ? a[$1]", " : "\t") $2} END{for (i in a) print i a[i]}' file
HS372_00498     non-virulent, lung
HS372_00954     jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung
HS372_01446     non-virulent, lung
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
0

In perl:

use warnings;
use strict; 

open my $input, '<', 'in.txt';

my %hash;
while (<$input>){
    chomp;
    my @split = split(' ');
    $hash{$split[0]}{$split[1]} = 1;
}

for my $key (keys %hash){
    print "$key\t";
        for my $info (keys $hash{$key}){
            print "$info\t";
        }
    print "\n";
} 

Which prints:

HS372_01446 non-virulent    lung    
HS372_00954 non-virulent    moderadamentevirulenta(nose)    jointlungCNS    lung    
HS372_00498 non-virulent    lung
fugu
  • 6,417
  • 5
  • 40
  • 75
  • Bernardos-MacBook-Pro:2014_02_12_membrane_genes_PHOBIUS bernardo$ ./script.pl Type of arg 1 to keys must be hash or array (not hash element) at ./script.pl line 16, near "})" Execution of ./script.pl aborted due to compilation errors. – biotech Feb 12 '14 at 10:18
0

If your data comes from a mysql database (you can import it into one), you can use the group_concat operator.

See this answer Can I concatenate multiple MySQL rows into one field?

This currently labeled with 431 upvotes, so your question is a very common problem and the answer shows a very elegant solution.

Community
  • 1
  • 1
knb
  • 9,138
  • 4
  • 58
  • 85