Collapse rows based on identical keys

Question

I would like to collapse rows based on equality of first column. Then add the content of second column to the new collapsed table, comma-separated and with additional space. Also, if content of second column are the same, collapse them, that is, if 'non-virulent' appears two times in output file, show it just once.

I'm quite new here, please explain how to run it. Hope anyone can help me!

Input (tab-delimited):

HS372_01446 non-virulent
HS372_01446 non-virulent
HS372_01446 lung
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 lung
HS372_00498 lung
HS372_00954 jointlungCNS
HS372_00954 non-virulent
HS372_00954 non-virulent
HS372_00954 moderadamentevirulenta(nose)
HS372_00954 lung

Desired output (tab-delimited):

HS372_01446 non-virulent, lung
HS372_00498 non-virulent, lung
HS372_00954 jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung

Why do some of your output lines (the last 1) have spaces after commas and others (the first 2) don't? — Ed Morton, Feb 12 '14 at 12:45

mpapec · Accepted Answer · 2014-02-12T11:38:38.207

2

Perl from command line,

perl -lane'
  ($n, $p) =@F;
  $s{$n}++ or push @r, $n;
  $c{$n}{$p}++ or push @{$h{$n}}, $p;
  END {
    $" = ",\t";
    print "$_\t@{$h{$_}}" for @r;
  }
' file

output

HS372_01446     non-virulent,   lung
HS372_00498     non-virulent,   lung
HS372_00954     jointlungCNS,   non-virulent,   moderadamentevirulenta(nose),  lung

edited Feb 12 '14 at 11:38

answered Feb 12 '14 at 10:26

mpapec

50,217
8
67
127

2

That's the most lines I've ever seen in a one-liner! – Borodin Feb 12 '14 at 11:34

Chris · Answer 2 · 2014-02-12T10:40:44.010

2

Another Perl solution:

#!/usr/bin/perl
use strict;
use warnings;
use List::MoreUtils qw/uniq/;

my %hash;
while ( <DATA> )
{
    chomp;
    my ( $key, $value ) = split;
    push @{$hash{$key}}, $value;
}

while ( my ( $key, $values ) = each %hash )
{
    print "$key\t", join ', ', uniq @$values, "\n";  
}

__DATA__
HS372_01446 non-virulent
HS372_01446 non-virulent
HS372_01446 lung
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 lung
HS372_00498 lung
HS372_00954 jointlungCNS
HS372_00954 non-virulent
HS372_00954 non-virulent
HS372_00954 moderadamentevirulenta(nose)
HS372_00954 lung

edited Feb 12 '14 at 10:40

answered Feb 12 '14 at 10:27

Chris

729
6
12

Dreadful variable names: `%hash` is as useful as `$scalar`. `List::MoreUtils` isn't a core module and may need installing. No point in the `chomp` as the `split` will ignore any whitespace. `"\t"` is rarely useful as it aligns output only for the most minimal set of data. But +1 because this is close to the best solution for an unsorted output. – Borodin Feb 12 '14 at 16:33

score 2 · Answer 3 · answered Feb 12 '14 at 11:29

This does what you ask, and in addition keeps the IDs and descriptions in the same order that they appear in the file, in case that matters:

use strict;
use warnings;

open my $fh, '<', 'diseases.txt';

my %diseases;
my @ids;

while (<$fh>) {
  my ($id, $desc) = split;
  if (not $diseases{$id}) {
    $diseases{$id}{list} = [$desc];
    $diseases{$id}{seen}{$desc} = 1;
    push @ids, $id;
  }
  elsif (not $diseases{$id}{seen}{$desc}) {
    push @{ $diseases{$id}{list} }, $desc;
    $diseases{$id}{seen}{$desc} = 1;
  }
}

for my $id (@ids) {
  printf "%s %s\n", $id, join ', ', @{ $diseases{$id}{list} };
}

output

HS372_01446 non-virulent, lung
HS372_00498 non-virulent, lung
HS372_00954 jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung

Jakob Bowyer · Answer 4 · 2014-02-12T10:10:27.023

1

from collections import defaultdict

a = """HS372_01446 non-virulent
HS372_01446 non-virulent
HS372_01446 lung
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 lung
HS372_00498 lung
HS372_00954 jointlungCNS
HS372_00954 non-virulent
HS372_00954 non-virulent
HS372_00954 moderadamentevirulenta(nose)
HS372_00954 lung""".split("\n")

stuff = defaultdict(set)

for line in a:
    uid, symp = line.split(" ")
    stuff[uid].add(symp)

for uid, symps in stuff.iteritems():
    print "%s %s" % (uid, ", ".join(list(symps)))

edited Feb 12 '14 at 10:10

answered Feb 12 '14 at 10:05

Jakob Bowyer

33,878
8
76
91

Traceback (most recent call last): File "script.py", line 22, in uid, symp = line.split(" ") ValueError: need more than 1 value to unpack – biotech Feb 12 '14 at 10:31

gabor · Answer 5 · 2014-02-12T10:55:31.717

Java:

javac Collapse.java

java Collapse input.txt

import java.io.*;
import java.util.*;

public class Collapse {

    public static void main(String[] args) throws Exception {
        BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(args[0])));

        Map<String, Set<String>> output = new HashMap<String, Set<String>>();
        String line;
        while ((line = br.readLine()) != null) {
            StringTokenizer st = new StringTokenizer(line, "\t");
            String key = st.nextToken();
            Set<String> set = output.get(key);
            if (set == null) {
                output.put(key, set = new LinkedHashSet<String>());
            }
            set.add(st.nextToken());
        }

        for (String key : output.keySet()) {
            StringBuilder sb = new StringBuilder();
            for (String value : output.get(key)) {
                if (sb.length() != 0) sb.append(", ");
                sb.append(value);
            }
            System.out.println(key + "\t" + sb);
        }
    }
}

Ed Morton · Answer 6 · 2014-02-12T20:07:57.300

1

The standard UNIX tool for parsing text files is awk:

$ awk '!seen[$1,$2]++{a[$1]=(a[$1] ? a[$1]", " : "\t") $2} END{for (i in a) print i a[i]}' file
HS372_00498     non-virulent, lung
HS372_00954     jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung
HS372_01446     non-virulent, lung

edited Feb 12 '14 at 20:07

answered Feb 12 '14 at 12:22

Ed Morton

188,023
17
78
185

fugu · Answer 7 · 2014-02-12T10:45:09.097

0

In perl:

use warnings;
use strict; 

open my $input, '<', 'in.txt';

my %hash;
while (<$input>){
    chomp;
    my @split = split(' ');
    $hash{$split[0]}{$split[1]} = 1;
}

for my $key (keys %hash){
    print "$key\t";
        for my $info (keys $hash{$key}){
            print "$info\t";
        }
    print "\n";
}

Which prints:

HS372_01446 non-virulent    lung    
HS372_00954 non-virulent    moderadamentevirulenta(nose)    jointlungCNS    lung    
HS372_00498 non-virulent    lung

edited Feb 12 '14 at 10:45

answered Feb 12 '14 at 10:14

fugu

6,417
5
40
75

Bernardos-MacBook-Pro:2014_02_12_membrane_genes_PHOBIUS bernardo$ ./script.pl Type of arg 1 to keys must be hash or array (not hash element) at ./script.pl line 16, near "})" Execution of ./script.pl aborted due to compilation errors. – biotech Feb 12 '14 at 10:18

score 0 · Answer 8 · edited May 23 '17 at 12:05

0

If your data comes from a mysql database (you can import it into one), you can use the group_concat operator.

See this answer Can I concatenate multiple MySQL rows into one field?

This currently labeled with 431 upvotes, so your question is a very common problem and the answer shows a very elegant solution.

edited May 23 '17 at 12:05

Community

1
1

answered Feb 12 '14 at 10:39

knb

9,138
4
58
85

Collapse rows based on identical keys

8 Answers8