I am looking for an efficient algorithm to allow mismatches (at most 3) when comparing a pattern
with a text. Original KMP does this job efficiently on my data but was considering this to extend this algo to accommodate for mismatches.
For my case: GACCCT
is considered a match with GGGGGAGGTTTTTT
with start position 4 in second sequence
I need to do pairwise comparison between two files. Each contains approximately 500,000 sequences. Sequences in one file is relatively short (~50 bases) while in other is longer (~200)
I tried Regex package in python, Levenshtein algorithm and edit distances. But they are slow and I will have to wait for couple of weeks to get the work done.