merge multiple files with similar column

Question

I have 30 files where column 1 is similar in each file. I would like to join the files based on column 1 so that the output file contains column 2 from each of the input files. I know how to join two files, but struggle with multiple files.

join -1 1 -2 1 File1 File2

The files are tab-separated with no header like this

File1

5S_rRNA 1324
5_8S_rRNA   32
7SK 15
ACA59   0
ACA64   0
BC040587    0
CDKN2B-AS   0
CDKN2B-AS_2 0
CDKN2B-AS_3 0
CLRN1-AS1   0

File2

5S_rRNA 571
5_8S_rRNA   11
7SK 5
ACA59   0
ACA64   0
BC040587    0
CDKN2B-AS   0
CDKN2B-AS_2 0
CDKN2B-AS_3 0
CLRN1-AS1   0

Output

5S_rRNA 1324 571
5_8S_rRNA   32 11
7SK 15 5
ACA59   0 0 
ACA64   0 0
BC040587    0 0
CDKN2B-AS   0 0
CDKN2B-AS_2 0 0
CDKN2B-AS_3 0 0
CLRN1-AS1   0 0

I have a solution here - but it does need a header row. You might need to 'fake one up' in order to get it to work. It looks for common (named) headers, and merges one or more CSV files based on it. http://stackoverflow.com/a/31245514/2566198 — Sobrique, Jul 10 '15 at 08:50

score 1 · Answer 1 · answered Jul 10 '15 at 08:56

First memory is the problem as the file size increases.Second if the ordering of the content is not important this will work good.

#!/usr/bin/perl
use strict;
use warnings;

my %hash;
my ($key,$value);
my @files=<files/*>;
foreach(@files){
open my $fh, '<', $_ or die "unable to open file: $! \n";
  while(<$fh>){
        chomp;
       ($key,$value)=split;
       push(@{$hash{$key}},$value);
    }
  close($fh);
}
for(keys %hash){
 print "$_ @{$hash{$_}} \n";
}

serenesat · Answer 2 · 2015-07-10T12:31:14.853

0

Below code will give your desire output but it will take more memory when number of files will increase (as you said there are 30 files). By using sort it sort the hash in alphabetical order of its keys (will give the output in same order as you mentioned in question).

#!/usr/bin/perl
use strict;
use warnings;

my @files = qw| input.log input1.log |; #you can give here path of files, or use @ARGV if you wish to pass files from command line 
my %data;

foreach my $filename (@files)
{
    open my $fh, '<', $filename or die "Cannot open $filename for reading: $!";
    while (my $line = <$fh>)
    {
        chomp $line;
        my ($col1, $col2) = split /\s+/, $line;
        push @{ $data{$col1} }, $col2; #create an hash of array
    }
}
foreach my $col1 (sort keys %data)
{
    print join("\t", $col1, @{ $data{$col1} }), "\n";    
}

Output:

5S_rRNA 1324    571
5_8S_rRNA   32  11
7SK 15  5
ACA59   0   0
ACA64   0   0
BC040587    0   0
CDKN2B-AS   0   0
CDKN2B-AS_2 0   0
CDKN2B-AS_3 0   0
CLRN1-AS1   0   0

edited Jul 10 '15 at 12:31

answered Jul 10 '15 at 09:47

serenesat

4,611
10
37
53

1

Can you add some explanations? – simbabque Jul 10 '15 at 10:42
1

Each line is clear. Can you please tell me what explanation I need to add for this small code? – serenesat Jul 10 '15 at 11:22
To me it's very clear. But I think the OP might appreciate if you explain why it makes sense to do it like this. :) – simbabque Jul 10 '15 at 11:45
@serenesat Nice script. I wonder if its possible to get the filenames as headers in the output? – BioMan Jul 24 '15 at 11:29
@BioMan: Do you want print all the filenames at the first line? – serenesat Jul 24 '15 at 11:40
yes, first row should be filenames, in the same order as the input files – BioMan Jul 24 '15 at 12:42
add this line `print join("\t", @files), "\n";` before first `foreach` loop. – serenesat Jul 24 '15 at 12:58

merge multiple files with similar column

File1

File2

Output

2 Answers2