The Perl code below works, but it doesn't scale well even with considerable computer resources. I hoping that someone can help me find more efficient code such as by replacing recursion with iteration, if that's the problem.
my data structure looks like this:
my %REV_ALIGN;
$REV_ALIGN{$dna}{$rna} = ();
Any dna key may have multiple rna sub keys. The same rna sub key may appear with multiple different dna keys. The purpose is to group rna ( transcripts ) based on shared dna sequence elements. For example, if dnaA has RNA1, RNA8, RNA9, and RNA4, and dnaB has RNA11, RNA4, and RNA99, then we group all these transcripts together ( RNA1, RNA9, RNA4, RNA11, RNA99 ) and continue to proceed to try and add to the group by selecting other dna. My recusive solution to this problem works but doesn't scale so well when using data from whole genome to transcriptome alignment.
SO MY QUESTION IS: WHAT IS A MORE EFFICIENT SOLUTION TO THIS PROBLEM? THANK YOU VERY MUCH
my @groups;
while ( my $x =()= keys %REV_ALIGN )
{
my @DNA = keys %REV_ALIGN;
my $dna = shift @DNA;
# the corresponding list of rna
my @RNA = keys %{$REV_ALIGN{$dna}};
delete $REV_ALIGN{$dna};
if ( $x == 1 )
{
push @groups, \@RNA;
last;
}
my $ref = group_transcripts ( \@RNA, \%REV_ALIGN );
push @groups, $ref;
}
sub group_transcripts
{
my $tran_ref = shift;
my $align_ref = shift;
my @RNA_A = @$tran_ref;
my %RNA;
# create a null hash with seed list of transcripts
@RNA{@RNA_A} = ();
# get a list of all remaining dna sequences in the alignment
my @DNA = keys %{$align_ref};
my %count;
# select a different list of transcripts
for my $dna ( @DNA )
{
next unless exists $align_ref->{$dna};
my @RNA_B = keys %{$align_ref->{$dna}};
# check to see two list share and transcripts
for my $element ( @RNA_A, @RNA_B )
{
$count{$element}++;
}
for my $rna_a ( keys %count )
{
# if they do, add any new transcripts to the current group
if ( $count{$rna_a} == 2 )
{
for my $rna_b ( @RNA_B )
{
push @RNA_A, $rna_b if $count{$rna_b} == 1;
}
delete $align_ref->{$dna};
delete $count{$_} foreach keys %count;
# recurse to try and continue adding to list
@_ = ( \@RNA_A, $align_ref );
goto &group_transcripts;
}
}
delete $count{$_} foreach keys %count;
}
# if no more transcripts can be added, return a reference to the group
return \@RNA_A;
}