1

I have 30 files where column 1 is similar in each file. I would like to join the files based on column 1 so that the output file contains column 2 from each of the input files. I know how to join two files, but struggle with multiple files.

join -1 1 -2 1 File1 File2

The files are tab-separated with no header like this

File1

5S_rRNA 1324
5_8S_rRNA   32
7SK 15
ACA59   0
ACA64   0
BC040587    0
CDKN2B-AS   0
CDKN2B-AS_2 0
CDKN2B-AS_3 0
CLRN1-AS1   0

File2

5S_rRNA 571
5_8S_rRNA   11
7SK 5
ACA59   0
ACA64   0
BC040587    0
CDKN2B-AS   0
CDKN2B-AS_2 0
CDKN2B-AS_3 0
CLRN1-AS1   0

Output

5S_rRNA 1324 571
5_8S_rRNA   32 11
7SK 15 5
ACA59   0 0 
ACA64   0 0
BC040587    0 0
CDKN2B-AS   0 0
CDKN2B-AS_2 0 0
CDKN2B-AS_3 0 0
CLRN1-AS1   0 0
Borodin
  • 126,100
  • 9
  • 70
  • 144
BioMan
  • 694
  • 11
  • 23
  • 2
    Better post your code also with expected output. – serenesat Jul 10 '15 at 08:24
  • is ordering of values important? – Arunesh Singh Jul 10 '15 at 08:49
  • I have a solution here - but it does need a header row. You might need to 'fake one up' in order to get it to work. It looks for common (named) headers, and merges one or more CSV files based on it. http://stackoverflow.com/a/31245514/2566198 – Sobrique Jul 10 '15 at 08:50

2 Answers2

1

First memory is the problem as the file size increases.Second if the ordering of the content is not important this will work good.

#!/usr/bin/perl
use strict;
use warnings;

my %hash;
my ($key,$value);
my @files=<files/*>;
foreach(@files){
open my $fh, '<', $_ or die "unable to open file: $! \n";
  while(<$fh>){
        chomp;
       ($key,$value)=split;
       push(@{$hash{$key}},$value);
    }
  close($fh);
}
for(keys %hash){
 print "$_ @{$hash{$_}} \n";
}
Arunesh Singh
  • 3,489
  • 18
  • 26
0

Below code will give your desire output but it will take more memory when number of files will increase (as you said there are 30 files). By using sort it sort the hash in alphabetical order of its keys (will give the output in same order as you mentioned in question).

#!/usr/bin/perl
use strict;
use warnings;

my @files = qw| input.log input1.log |; #you can give here path of files, or use @ARGV if you wish to pass files from command line 
my %data;

foreach my $filename (@files)
{
    open my $fh, '<', $filename or die "Cannot open $filename for reading: $!";
    while (my $line = <$fh>)
    {
        chomp $line;
        my ($col1, $col2) = split /\s+/, $line;
        push @{ $data{$col1} }, $col2; #create an hash of array
    }
}
foreach my $col1 (sort keys %data)
{
    print join("\t", $col1, @{ $data{$col1} }), "\n";    
}

Output:

5S_rRNA 1324    571
5_8S_rRNA   32  11
7SK 15  5
ACA59   0   0
ACA64   0   0
BC040587    0   0
CDKN2B-AS   0   0
CDKN2B-AS_2 0   0
CDKN2B-AS_3 0   0
CLRN1-AS1   0   0
serenesat
  • 4,611
  • 10
  • 37
  • 53