Finding imperfect and perfect patterns within string

Question

I am working on a Perl script to search within a string of nucleotides for patterns. So far, I've been able to use the following regexs

    my $regex1 = qr/( ([ACGT]{2}) \2{9,} )/x;
    my $regex2 = qr/( ([ACGT]{3}) \2{6,} )/x;
    my $regex3 = qr/( ([ACGT]{4}) \2{6,} )/x;
for my $regex ($regex1, $regex2, $regex3) {
    next unless $seq1 =~ $regex;
    printf "Matched %s exactly %d times\n", $2, length($1)/length($2);
    printf "Length of sequence: $number \n";
}

How would I go about doing the following?

-finding perfect (repeated with no interruption) and imperfect (repeated but can have string of repeats broken by a nucleotide) with a minimum of 10 repeats needed.

-print the entire found sequence

SAMPLE INPUT - GTCGTGTGTGTGTAGTGTGTGTGTGTGAACTGA

Current script in its entirety

print "Di-, Tri-, Tetra-nucleotide Tandem Repeat Finder v1.0 \n\n";
print "Please specify the file location (DO NOT DRAG/DROP files!) then press ENTER:\n";
$seq = <STDIN>;

#Remove the newline from the filename
chomp $seq;

#open the file or exit
open (SEQFILE, $seq) or die "Can't open '$seq': $!";

#read the dna sequence from the file and store it into the array variable @seq1
@seq1 = <SEQFILE>;

#Close the file
close SEQFILE;

#Put the sequence into a single string as it is easier to search for the motif
$seq1 = join( '', @seq1);

#Remove whitespace
$seq1 =~s/\s//g;

#Count of number of nucleotides
#Initialize the variable
$number = 0;
$number = length $seq1;
#Use regex to say "Find 3 nucelotides and match at least 6 times
# qr(quotes and compiles)/( ([nucs]{number of nucs in pattern}) \2{number of repeats,}/x(permit within pattern)

my $regex1 = qr/( ([ACGT]{2}) \2{9,} )/x;
my $regex2 = qr/( ([ACGT]{3}) \2{6,} )/x;
my $regex3 = qr/( ([ACGT]{4}) \2{6,} )/x;

#Tell program to use $regex on variable that holds the file
for my $regex ($regex1, $regex2, $regex3) {
    next unless $seq1 =~ $regex;
    printf "Matched %s exactly %d times\n", $2, length($1)/length($2);
    printf "Length of sequence: $number \n";
}

exit;

Perhaps you should include some sample input/output and test cases. — TLP, Feb 23 '13 at 14:21
And what is the output you want with this sample input? You have to realize that not everyone is familiar with biology terminology and DNA jargon. — TLP, Feb 23 '13 at 14:40
You're right, sorry. I would need the output to tell me the what two nucleotides are the repeating elements, how many times the repeat was found, and the entire sequence (so from where the repeat begins to where the repeat ends) — Citizin, Feb 23 '13 at 14:43

score 0 · Answer 1 · edited May 23 '17 at 11:43

Not sure a I fully understand what you need, but perhaps this will give you an idea:

use strict;    # You should be using this,
use warnings;  # and this.

my $input = 'GTCGTGTGTGTGTAGTGTGTGTGTGTGAACTGA';

my $patt      = '[ACGT]{2}';   # Some pattern of interest.
my $intervene = '[ACGT]*';     # Some intervening pattern.
my $m         = 7 - 2;         # Minimum N of times to find pattern, less 2.

my $rgx = qr/( 
    ($patt) $intervene
    (\2     $intervene ){$m,}
    \2
)/x;

print $1, "\n" if $input =~ $rgx;

Also, see this question for better ways to read an entire file into a string: What is the best way to slurp a file into a string in Perl?.

Finding imperfect and perfect patterns within string

1 Answers1