Matching Values in Hashes

Question

I have two arrays of hashes. I want to narrow down the second one according to variables in the first.

The first array contains hashes with keys seqname, source, feature, start, end, score, strand, frame, geneID and transcriptID.

The second array contains hashes with keys organism, geneID, number, motifnumber, position, strand and sequence.

What I want to do, is remove from the first array of hashes, all the hashes which have a variable geneID which is not found in any of the hashes of the second array. - Note both types of hash have the geneID key. Simply put, I want to keep those hashes in the first array, which have geneID values which are found in the hashes of the second array.

My attempt at this so far was with two loops:

my @subset # define a new array for the wanted hashes to go into.

for my $i (0 .. $#first_hash_array){  # Begin loop to go through the hashes of the first array.

    for my $j (0 .. $#second_hash_array){ # Begin loop through the hashes of the 2nd array.

        if ($second_hash_array[$j]{geneID} =~ m/$first_hash_array[$i]{geneID}/)
        {
           push @subset, $second_hash_array[$j];
        }

    }

}

However I'm not sure that this is the right way to go about this.

ikegami · Accepted Answer · 2013-04-16T01:02:35.177

For starters, $a =~ /$b/ doesn't check for equality. You'd need

$second_hash_array[$j]{geneID} =~ m/^\Q$first_hash_array[$i]{geneID}\E\z/

or simply

$second_hash_array[$j]{geneID} eq $first_hash_array[$i]{geneID}

for that.

Secondly,

for my $i (0 .. $#first_hash_array) {
   ... $first_hash_array[$i] ...
}

can be written more succinctly as

for my $first (@first_hash_array) {
   ... $first ...
}

Next on the list is that

for my $second (@second_hash_array) {
    if (...) {
       push @subset, $second;
    }
}

can add $second to @subset more than once. You either need to add a last

# Perform the push if the condition is true for any element.
for my $second (@second_hash_array) {
   if (...) {
      push @subset, $second;
      last;
   }
}

or move the push out of the loop

# Perform the push if the condition is true for all elements.
my $flag = 1;
for my $second (@second_hash_array) {
   if (!...) {
      $flag = 0;
      last;
   }
}

if ($flag) {
   push @subset, $second;
}

depending on what you want to do.

To remove from an array, one would use splice. But removing from an array messes up all the indexes, so it's better to iterate the array backwards (from last to first index).

Not only is it complicated, it's also expensive. Every time you splice, all subsequent elements in the array need to moved.

A better approach is to filter the elements and assign the resulting element to the array.

my @new_first_hash_array;
for my $first (@first_hash_array) {
   my $found = 0;
   for my $second (@second_hash_array) {
      if ($first->{geneID} eq $second->{geneID}) {
         $found = 1;
         last;
      }
   }

   if ($found) {
      push @new_first_hash_array, $first;
   }
}

@first_hash_array = @new_first_hash_array;

Iterating through @second_hash_array repeatedly is needlessly expensive.

my %geneIDs_to_keep;
for (@second_hash_array) {
   ++$geneIDs_to_keep{ $_->{geneID} };
}

my @new_first_hash_array;
for (@first_hash_array) {
   if ($geneIDs_to_keep{ $_->{geneID} }) {
      push @new_first_hash_array, $_;
   }
}

@first_hash_array = @new_first_hash_array;

Finally, we can replace that for with a grep to give the following simple and efficient answer:

my %geneIDs_to_keep;
++$geneIDs_to_keep{ $_->{geneID} } for @second_hash_array;

@first_hash_array = grep $geneIDs_to_keep{ $_->{geneID} }, @first_hash_array;

Thanks for replying, I can't be sure but I think actually this deletes what I want and keeps what I want to delete. If I wanted to keep only the hashes in first_hash_array with geneID's which match those from the other second_hash_array shouldn't it be something like: `my %geneIDs_to_keep; ++$geneIDs_to_keep{ $_->{geneID}}for @second_hash_array;` To get the IDs I want to keep, and then something like `my @new_array = grep $geneIDs_to_keep{ $_->{geneID} }, @first_hash_array;` ? — SJWard, Apr 15 '13 at 19:29
In addition if you have time, can you expand on the final code chunk some more? To my newbie sense of understanding I can see that the first two lines create a hash with all the geneID's to be deleted/kept, by going through every hash in the array and getting the geneID from every hash, using a loop and the default variable. The final line to me is the more difficult to understand. I look at the grep page here: http://perldoc.perl.org/functions/grep.html. Given the simple example `@foo = grep {!/^#/} @bar;` it's the ` !$geneIDs_to_delete{ $_->{geneID} }` I'm finding hard to interpret. — SJWard, Apr 15 '13 at 19:45
Reading some more, does the final line iterate through, `@first_hash_array`, sets the `$_` so the `$_->{geneID}` part for example, becomes 'ID_002' if that was the `geneID` of the element of `first_hash_array`, and then the rest becomes `grep !$geneIDs_to_delete{ID_002}` which tests to see if ID_002 is/is-not in the list of genes to delete? — SJWard, Apr 15 '13 at 20:03
The last chunk is exactly equivalent to the second last chunk. `!$geneIDs_to_delete{ $_->{geneID} }` (or the updated `$geneIDs_to_keep{ $_->{geneID} }`) is executed for every element of the list passed to `grep` to determine whether to keep (if true) or discard (if false) in the returned list. Yes, your understanding is correct. — ikegami, Apr 16 '13 at 01:04

score 1 · Answer 2 · edited May 23 '17 at 12:04

This is how I would do it.

Create an array req_geneID for geneIDs required and put all geneIds of the second hash in it.

Traverse the first hash and check if the geneId is contained in the req_geneID array.(its easy in ruby using "include?" but you may try this in perl)

and,

Finally delete the hash that doesnot match any geneID in req_geneID using this in perl

for (keys %hash)
{
    delete $hash{$_};
}

Hope this helps.. :)

Matching Values in Hashes

2 Answers2