I have been using Longest Common Subsequence (LCS) to find the similarly between sequences. The following dynamic programming code computes the answer.
def lcs(a, b):
lengths = [[0 for j in range(len(b)+1)] for i in range(len(a)+1)]
# row 0 and column 0 are initialized to 0 already
for i, x in enumerate(a):
for j, y in enumerate(b):
if x == y:
lengths[i+1][j+1] = lengths[i][j] + 1
else:
lengths[i+1][j+1] = \
max(lengths[i+1][j], lengths[i][j+1])
# read the substring out from the matrix
result = ""
x, y = len(a), len(b)
while x != 0 and y != 0:
if lengths[x][y] == lengths[x-1][y]:
x -= 1
elif lengths[x][y] == lengths[x][y-1]:
y -= 1
else:
assert a[x-1] == b[y-1]
result = a[x-1] + result
x -= 1
y -= 1
return result
However I have realised what I really want to solve is a little different. Given a fixed k I need to make sure that the common subsequence only involves substrings of length exactly k. For example, set k = 2 and let the two strings be
A = "abcbabab"
B = "baababcc"
The subsequence that I need would be "ba"+"ab" = baab
.
Is it possible to modify the dynamic programming solution to solve this problem?
The original longest common subsequence problem would just be the k=1 case.
A method that doesn't work.
If we perform the LCS algorithm above, we can get the alignment from the dynamic programming table and check whether those symbols appear in non-overlapping substrings of length k in both input sequences, and delete them if not. The problem is that this doesn't give an optimal solution.