0

The Perl code below works, but it doesn't scale well even with considerable computer resources. I hoping that someone can help me find more efficient code such as by replacing recursion with iteration, if that's the problem.

my data structure looks like this: my %REV_ALIGN;
$REV_ALIGN{$dna}{$rna} = ();

Any dna key may have multiple rna sub keys. The same rna sub key may appear with multiple different dna keys. The purpose is to group rna ( transcripts ) based on shared dna sequence elements. For example, if dnaA has RNA1, RNA8, RNA9, and RNA4, and dnaB has RNA11, RNA4, and RNA99, then we group all these transcripts together ( RNA1, RNA9, RNA4, RNA11, RNA99 ) and continue to proceed to try and add to the group by selecting other dna. My recusive solution to this problem works but doesn't scale so well when using data from whole genome to transcriptome alignment.

SO MY QUESTION IS: WHAT IS A MORE EFFICIENT SOLUTION TO THIS PROBLEM? THANK YOU VERY MUCH

my @groups;

while ( my $x =()= keys %REV_ALIGN )
{
    my @DNA = keys %REV_ALIGN;
    my $dna = shift @DNA;
    # the corresponding list of rna
    my @RNA = keys %{$REV_ALIGN{$dna}};
    delete $REV_ALIGN{$dna};

    if ( $x == 1 )
    {
        push @groups, \@RNA;
        last;
    }

    my $ref = group_transcripts ( \@RNA, \%REV_ALIGN );
    push @groups, $ref;
}


sub group_transcripts
{
    my $tran_ref = shift;
    my $align_ref = shift;
    my @RNA_A = @$tran_ref;
    my %RNA;
    # create a null hash with seed list of transcripts
    @RNA{@RNA_A} = ();
    # get a list of all remaining dna sequences in the alignment
    my @DNA = keys %{$align_ref};
    my %count;
    # select a different list of transcripts
    for my $dna ( @DNA )
    {
        next unless exists $align_ref->{$dna};
        my @RNA_B = keys %{$align_ref->{$dna}};
        # check to see two list share and transcripts
        for my $element ( @RNA_A, @RNA_B )
        {
            $count{$element}++;
        }
        for my $rna_a ( keys %count )
        {
            # if they do, add any new transcripts to the current group
            if ( $count{$rna_a} == 2 )
            {
                for my $rna_b ( @RNA_B )
                {
                    push @RNA_A, $rna_b if $count{$rna_b} == 1;
                }
                delete $align_ref->{$dna};
                delete $count{$_} foreach keys %count;
                # recurse to try and continue adding to list
                @_ = ( \@RNA_A, $align_ref );
                goto &group_transcripts;
            }
        }
        delete $count{$_} foreach keys %count;
    }
   # if no more transcripts can be added, return a reference to the group
    return \@RNA_A;
}
robertb
  • 21
  • 4
  • You may want to profile your code http://stackoverflow.com/questions/4371714/how-do-i-profile-my-perl-programs Btw, `delete $count{$_} foreach keys %count;` can be replaced with `%count = ();` Btw2, don't copy arrays around if not necessary, ie. `my @RNA_A = @$tran_ref;` – mpapec Oct 06 '15 at 17:25
  • `my $x =()= keys %REV_ALIGN` => `my $x = keys %REV_ALIGN` – mpapec Oct 06 '15 at 17:40
  • thanks for some suggestions everyone. – robertb Oct 06 '15 at 17:41
  • is what correct? list context returns a list but I don't think scalar context would work by itself – robertb Oct 06 '15 at 19:22

1 Answers1

1

You have a loops nested four deep. It's an pretty safe bet that's why your code scales poorly.

If I understand correctly what you are trying to accomplish, the input

my %REV_ALIGN = (
   "DNA1" => { map { $_ => undef } "RNA1", "RNA2" }, # \ Linked by RNA1     \
   "DNA2" => { map { $_ => undef } "RNA1", "RNA3" }, # /  \ Linked by RNA3   > Group
   "DNA3" => { map { $_ => undef } "RNA3", "RNA4" }, #    /                 /

   "DNA4" => { map { $_ => undef } "RNA5", "RNA6" }, # \ Linked by RNA5     \  Group
   "DNA5" => { map { $_ => undef } "RNA5", "RNA7" }, # /                    /

   "DNA6" => { map { $_ => undef } "RNA8" },         #                      >  Group
);

should result in

my @groups = (
   [
      dna => [ "DNA1", "DNA2", "DNA3" ],
      rna => [ "RNA1", "RNA2", "RNA3", "RNA4" ],
   ],
   [
      dna => [ "DNA4", "DNA5" ],
      rna => [ "RNA5", "RNA6", "RNA7" ],
   ],
   [
      dna => [ "DNA6" ],
      rna => [ "RNA8" ],
   ],
);

If so, you can use the following:

use strict;
use warnings;

use Graph::Undirected qw( );

my %REV_ALIGN = (
   "DNA1" => { map { $_ => undef } "RNA1", "RNA2" },
   "DNA2" => { map { $_ => undef } "RNA1", "RNA3" },
   "DNA3" => { map { $_ => undef } "RNA3", "RNA4" },
   "DNA4" => { map { $_ => undef } "RNA5", "RNA6" },
   "DNA5" => { map { $_ => undef } "RNA5", "RNA7" },
   "DNA6" => { map { $_ => undef } "RNA8" },
);

my $g = Graph::Undirected->new();
for my $dna (keys(%REV_ALIGN)) {
   for my $rna (keys(%{ $REV_ALIGN{$dna} })) {
      $g->add_edge("dna:$dna", "rna:$rna");
   }
}

my @groups;
for my $raw_group ($g->connected_components()) {
   my %group = ( dna => [], rna => [] );
   for (@$raw_group) {
      my ($type, $val) = split(/:/, $_, 2);
      push @{ $group{$type} }, $val;
   }

   push @groups, \%group;
}

use Data::Dumper qw( Dumper );
print(Dumper(\@groups));

If you just want the RNA, the final section simplifies to the following:

my @groups;
for my $raw_group ($g->connected_components()) {
   my @group;
   for (@$raw_group) {
      my ($type, $val) = split(/:/, $_, 2);
      push @group, $val if $type eq 'rna';
   }

   push @groups, \@group;
}
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • Thanks for taking the time to respond with this. I'm not familiar with graphs. Having the dna as well isn't required but I'll see how this works and get back to you – robertb Oct 06 '15 at 20:05
  • Added RNA-only version to bottom of answer. – ikegami Oct 06 '15 at 20:09