8

I am looking for an efficient solution to do find the longest possible substring in a string tolerating n mismatches in the main string

Eg: Main String

  1. AGACGTACTACTCTACTAGATGCA*TACTCTAC*
  2. AGACGTACTACTCTACTAGATGCA*TACTCTAC*
  3. AGACGTACTACTCTACAAGATGCA*TACTCTAC*
  4. AGACGTACTACTTTACAAGATGCA*TACTCTAC*

Search String:

  1. TACTCTACT : this should be considered a match to all of the above main strings.

Also I there could be the case where part of the substring is at the end of main string and I would like to pick that up also.

I would appreciate if you could give some pointers.

PS: I will have one search string and about 100 million main strings to search the substring for.

Thanks! -Abhi

Alex
  • 5,863
  • 2
  • 29
  • 46
Abhi
  • 6,075
  • 10
  • 41
  • 55
  • Questions. 1) Your numbers in "Main String" - are those separate lines in one "Main String", or four separate "Main String"s (I assume the latter, but want to confirm)? 2) Can you explain why that search string would be considered a match for #3 and #4? The first two clearly match, but it's not clear to me why the last two do, and that sounds like it may be a crucial detail to understand. 3) Can you explain what you mean that part of the substring may be at the end of the main string? – Brian Gerard Jun 06 '11 at 23:27
  • You mention "longest", but your example doesn't provide any clue as to what you mean by longest. Do you mean with the fewest mismatch? – ikegami Jun 06 '11 at 23:48
  • 1
    @Brian Gerard: 1) Those are 4 of the million main strings. 2) Because he tolerates mismatches. #3: A for T, #4, T for C, A for T. 3) "...TACT" should end-of-line-match search string TACTCTACT, because it might be followed by "CTACT". – ikegami Jun 07 '11 at 00:12
  • @Brian : 1.) I showed 4 different types of main strings for example purpose. 2.) I am allowing for n mismatches. In this case n = 1,2 etc. 3.) the search string could be partially contained in the main string and we want to match that partial case also. Thanks for your questions – Abhi Jun 07 '11 at 01:58
  • @ikegami : By longest I mean the literally the "longest" and best possible match ( least mismatches). In way this could be misleading and thanks for asking. So longest might not be the right word. What I meant was if the substring is partially contained at the end of main string I would like to pick that case and not ingore it. I hope this makes it a bit more clearer. – Abhi Jun 07 '11 at 02:00
  • `TACTCTAC*` is a better match than `TACTTTACA` in your fourth example – ikegami Jun 07 '11 at 06:19
  • @ikegami : thanks for your help so far. I have another follow up question. Previously I had one query string compared to millions of search string one at a time for a substring occurence. Now if I want to increase my search space to include multiple query strings, I can either run the same logic again but can I use other data structures which can build something like a suffix tree for query strings and use that for searching substring in the search string ?? Thanks! – Abhi Jun 14 '11 at 23:52

2 Answers2

11

I'm not certain that this is what you're after but BioPerl has an approximate-grep tool called Bio::Grep::Backend::Agrep:

Bio::Grep::Backend::Agrep searches for a query with agrep

And agrep is "approximate grep". AFAIK, this builds a database and then uses that database to make your searches faster so this sounds like what you're after.

Looks like you're working with DNA sequences so you probably have BioPerl already installed.

You could also try using String::Approx directly:

Perl extension for approximate matching (fuzzy matching)

I suspect that Bio::Grep::Backend::Agrep will be faster and a better match for your task though.

mu is too short
  • 426,620
  • 70
  • 833
  • 800
3
use strict;
use warnings;
use feature qw( say );

sub match {
   my ($s, $t, $max_x) = @_;

   my $m = my @s = unpack('(a)*', $s);
   my $n = my @t = unpack('(a)*', $t);

   my @length_at_k     = ( 0 ) x ($m+$n);
   my @mismatches_at_k = ( 0 ) x ($m+$n);
   my $offset = $m;

   my $best_length = 0;
   my @solutions;
   for my $i (0..$m-1) {
      --$offset;

      for my $j (0..$n-1) {
         my $k = $j + $offset;

         if ($s[$i] eq $t[$j]) {
            ++$length_at_k[$k];
         }
         elsif ($length_at_k[$k] > 0 && $mismatches_at_k[$k] < $max_x) {
            ++$length_at_k[$k];
            ++$mismatches_at_k[$k];
         }
         else {
            $length_at_k[$k] = 0;
            $mismatches_at_k[$k] = 0;
         }

         my $length = $length_at_k[$k] + $max_x - $mismatches_at_k[$k];
         $length = $i+1 if $length > $i+1;

         if ($length >= $best_length) {
            if ($length > $best_length) {
               $best_length = $length;
               @solutions = ();
            }

            push @solutions, $i-$length+1;
         }
      }
   }

   return map { substr($s, $_, $best_length) } @solutions;
}

say for match('AABBCC', 'DDBBEE', 2);

Output:

AABB
ABBC
BBCC
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • you code works absolutely fine. Just wanted to get some idea if the number of sequences that I am checking in a certain amount time sounds feasible or is there a way to scale it up. I am searching for a 36 alphabets substring (with upto 5 mismatches) in 90 million 100 alphabets long search sequences. It is taking me about 13 hours to do this search. Does this sounds like normal based on your experience. I haven't time/memory profiled my code. Thanks! – Abhi Jun 09 '11 at 18:32
  • @Abhi, I'm no expert in algorithms, but I took every shortcut I could think of. You could make it an order or two faster by implementing it in C. – ikegami Jun 09 '11 at 19:19
  • 1
    @Mariya, It's based on a "dynamic programming" solution to the Longest Common Substring problem. See an explanation of what it's based on [here](http://en.wikipedia.org/wiki/Longest_common_substring_problem#Dynamic_programming). – ikegami Dec 09 '11 at 21:41