Returning sequences with two mismatches from original string

Question

I am looking for some code to return all sequences with two mismatches from the original string, for the purposes of finding parts of a protein sequence that are similar to the original sequence I input. For example, searching for LKLD in LELFLKEF should return: LELF LKEF LFLK I have looked at various python approaches to do this, but I can't seem to make any work.

Search for string allowing for one mismatch in any location of the string

String regex two mismatches Python

Ambiguous substring with mismatches

Does order matter? Would LKLD in DLLKFA return LKFA and DLLK (all the letters in a dif order). — Stucco, Feb 11 '18 at 05:26

avigil · Answer 1 · 2018-02-11T15:42:29.787

A simple approach would be to roll through the sequence and calculate hamming distance for each alignment of the query 'LKLD' to the subject sequence 'LELFLKEF'. There is a sample implementation of hamming distance calculation in the linked wikipedia article. Once you have that your code would do something like:

# hamming distance
d = lambda s1, s2: sum(e1 != e2 for e1, e2 in zip(s1, s2))

subject = 'LELFLKEF'
query = 'LKLD'
for i in range(len(subject)-len(query)+1):
    aligned_subject = subject[i:i+len(query)]
    if d(aligned_subject, query) == 2:
         print(aligned_subject)

Output:

LELF
LFLK
LKEF

Note that this is a bit of a naive solution with plenty of room for optimization, but it will work for simple cases and reasonably small strings. A condensed version that produces a list:

s='LELFLKEF'
q='LKLD'
d= lambda s1, s2: sum(e1 != e2 for e1, e2 in zip(s1, s2))
[s[i:i+len(q)] for i in range(len(s)-len(q)+1) if d(s[i:i+len(q)],q) == 2]

The for loop rolls through all possible ungapped alignments of your two strings:

0
LELFLKEF
||||
LKLD
 1
LELFLKEF
 ||||
 LKLD
  2
LELFLKEF
  ||||
  LKLD
   3
LELFLKEF
   ||||
   LKLD
    4
LELFLKEF
    ||||
    LKLD

There are also many implementations for the problem of alignment of biological sequences so you might also want to explore some more involved techniques that handle things like gapped alignment and more complicated modeling of substitutions

Your code appears to ignore the order of the letters: it gives results like ELFL, which is indeed two letters different from the original sequence, but four letters different if you look at what positions they are. — Mikhail Fomin, Feb 11 '18 at 14:58
i was printing out all comparisons to illustrate, its easy enough to filter by distance == 2 but I will edit — avigil, Feb 11 '18 at 15:07

Returning sequences with two mismatches from original string

1 Answers1