0

I have a file with multiple strings; say data.fa.

sp|P08246|ELNE_HUMAN Neutrophil elastase OS=Homo sapiens GN=ELANE PE=1 SV=1
MTLGRRLACLFLACVLPALLLGGTALASEIVGGRRARPHAWPFMVSLQLRGGHFCGATLI
APNFVMSAAHCVANVNVRAVRVVLGAHNLSRREPTRQVFAVQRIFENGYDPVNLLNDIVI
LQLNGSATINANVQVAQLPAQGRRLGNGVQCLAMGWGLLGRNRGIASVLQELNVTVVTSL
CRRSNVCTLVRGRQAGVCFGDSGSPLVCNGLIHGIASFVRGGCASGLYPDAFAPVAQFVN
WIDSIIQRSEDNPCPHPRDPDPASRTHGGGGNGVQCLAMGWG
sp|P31689|DNJA1_HUMAN DnaJ homolog subfamily A member 1 OS=Homo sapiens GN=DNAJA1 PE=1 SV=2
MVKETTYYDVLGVKPNATQEELKKAYRKLALKYHPDKNPNEGEKFKQISQAYEVLSDAKK
RELYDKGGEQAIKEGGAGGGFGSPMDIFDMFFGGGGRMQRERRGKNVVHQLSVTLEDLYN
GATRKLALQKNVICDKCEGRGGKKGAVECCPNCRGTGMQIRIHQIGPGMVQQIQSVCMEC
QGHGERISPKDRCKSCNGRKIVREKKILEVHIDKGMKDGQKITFHGEGDQEPGLEPGDII
sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens GN=GZMB PE=1 SV=2
MQPILLLLAFLLLPRADAGEIIGGHEAKPHSRPYMAYLMIWDQKSLKRCGGFLIRDDFVL
TAAHCWGSSINVTLGAHNIKEQEPTQQFIPVKRPIPHPAYNPKNFSNDIMLLQLERKAKR
TRAVQPLRLPSNKAQVKPGQTCSVAGWGQTAPLGKHSHTLQEVKMTVQEDRKCESDLRHY
YDSTIELCVGDPEIKKTSFKGDSGGPLVCNKVAQGIVSYGRNNGMPPRACTKVSSFVHWI
KKTMKRYGNGVQCLAMGWG

I am trying to print the header and the no of motifs (GNGVQCLAMGWG) if any on an output file. Yeah ! it's a newbie here. I have the following code

   #!/usr/bin/perl

use strict;
use warnings;

print STDOUT "Enter the motif: ";
my $motif = <STDIN>;
chomp $motif;

my %seqs = %{ read_fasta_as_hash( 'data.fa' ) };
foreach my $id ( keys %seqs ) {
    if ( $seqs{$id} =~ /$motif/ ) {
        print $id, "\n";
        print $seqs{$id}, "\n";
    }
}

sub read_fasta_as_hash {
    my $fn = shift;

    my $current_id = '';
    my %seqs;
    open FILE, "<$fn" or die $!;
    while ( my $line = <FILE> ) {
        chomp $line;
        if ( $line =~ /^(>.*)$/ ) {
            $current_id  = $1;
        } elsif ( $line !~ /^\s*$/ ) { # skip blank lines
            $seqs{$current_id} .= $line
        }
    }
    close FILE or die $!;

    return \%seqs;
}

I am expecting output like:

sp|P08246|ELNE_HUMAN Neutrophil elastase OS=Homo sapiens GN=ELANE PE=1 SV=1: 02
sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens GN=GZMB PE=1 SV=2: 01

I need help.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156

1 Answers1

0

So, here:

if ( $seqs{$id} =~ /$motif/ ) {
    print $id, "\n";
    print $seqs{$id}, "\n";
}

You're close to the mark, but doing unnecessary work. If all we need is the number of matches, we can do simply count and format the result to get your leading zero:

my $matches = () = $seqs{$id} =~ /$motif/g;
if ($matches > 0) {
    my $matches_string = sprintf('%02d', $matches);
    print "$id: $matches_string\n";
}

As an aside, if there's any chance that $motif will ever contain regex metacharacters (and you might as well assume it could), you can escape it:

#not escaped
/$motif/g

#escaped
/\Q$motif\E/g

Finally, do you need the results in any particular order? The keys operator doesn't guarantee that you'll get keys out in the same order they're inserted.

Community
  • 1
  • 1
rutter
  • 11,242
  • 1
  • 30
  • 46
  • Not really, but the most preferred part is the output order i.e, – user3489854 Apr 09 '14 at 20:51
  • Not really, but the most preferred part is the output order i.e, >FASTA_header1: number of motifs – user3489854 Apr 09 '14 at 20:52
  • Some of the functions are real hard to fit in my brain. Could you please type in the actual script for me to run. Thanks Rutter. – user3489854 Apr 09 '14 at 20:58
  • As an output, in addition to the "Fasta header:count of matches"; is it possible to get the length of each string as well? – user3489854 Apr 10 '14 at 14:06
  • @user3489854 Perl uses [`length`](http://perldoc.perl.org/functions/length.html) to get a string's length. Be careful about whether or not you want to count newlines and so forth, though. – rutter Apr 10 '14 at 20:28