1

I am looking for an efficient algorithm to allow mismatches (at most 3) when comparing a pattern with a text. Original KMP does this job efficiently on my data but was considering this to extend this algo to accommodate for mismatches.

For my case: GACCCT is considered a match with GGGGGAGGTTTTTT with start position 4 in second sequence

I need to do pairwise comparison between two files. Each contains approximately 500,000 sequences. Sequences in one file is relatively short (~50 bases) while in other is longer (~200)

I tried Regex package in python, Levenshtein algorithm and edit distances. But they are slow and I will have to wait for couple of weeks to get the work done.

user1140126
  • 2,621
  • 7
  • 29
  • 34
  • `strstr` is good for direct matches, perhaps using something like `strstr` for the first half of a pattern and count the mismatches in the second half, then `strstr` to find the second half and count the mismatches in the first half would work? Of course for three mismatches, you'd need a minimum of 1/4 of the string to match with and count the mismatches in the rest... In your example, finding any one of the letters in your pattern would suffice, since you are allowing up to three mismatches. – abiessu Nov 21 '13 at 18:13
  • See also http://stackoverflow.com/questions/2420412/search-for-string-allowing-for-one-mismatch-in-any-location-of-the-string?rq=1 – abiessu Nov 21 '13 at 18:21
  • been there but speed is the deterrent. I need to to this few hundred billion times. The runtime of these are not great as it will take weeks to do this. That is why I am now looking towards KMP to see if can be modified. – user1140126 Nov 21 '13 at 18:26
  • How big is your data (approximately)? 1 KB, 1 MB, 100 MB, 1 GB...? – user541686 Nov 21 '13 at 18:27

2 Answers2

0

I think your data isn't too large, so maybe this will work:
I think you should create a suffix tree for your data. Once you do this, finding substrings will be very easy, whether or not you want to count mismatches: you just traverse the tree with the characters you're looking for, until you've either found a substring, or hit the most number of mismatches you can tolerate.

user541686
  • 205,094
  • 128
  • 528
  • 886
0

If you want at most three mismatches, there's a simple but kind of daft algorithm that'll work on most real cases. Break your pattern into four contiguous parts arbitrarily. (It is probably useful for them to match a random text location with roughly the same probability.) Find all matches in the text of your four contiguous parts. See which of those completes to an at-most-three-mismatches match by brute force.

Mehrdad's solution of using a suffix tree is better in general, but it requires more programming effort.

tmyklebu
  • 13,915
  • 3
  • 28
  • 57