3

I have been using Longest Common Subsequence (LCS) to find the similarly between sequences. The following dynamic programming code computes the answer.

def lcs(a, b):
    lengths = [[0 for j in range(len(b)+1)] for i in range(len(a)+1)]
    # row 0 and column 0 are initialized to 0 already
    for i, x in enumerate(a):
        for j, y in enumerate(b):
            if x == y:
                lengths[i+1][j+1] = lengths[i][j] + 1
            else:
                lengths[i+1][j+1] = \
                    max(lengths[i+1][j], lengths[i][j+1])
    # read the substring out from the matrix
    result = ""
    x, y = len(a), len(b)
    while x != 0 and y != 0:
        if lengths[x][y] == lengths[x-1][y]:
            x -= 1
        elif lengths[x][y] == lengths[x][y-1]:
            y -= 1
        else:
            assert a[x-1] == b[y-1]
            result = a[x-1] + result
            x -= 1
            y -= 1
    return result

However I have realised what I really want to solve is a little different. Given a fixed k I need to make sure that the common subsequence only involves substrings of length exactly k. For example, set k = 2 and let the two strings be

A = "abcbabab"
B = "baababcc"

The subsequence that I need would be "ba"+"ab" = baab.

Is it possible to modify the dynamic programming solution to solve this problem?

The original longest common subsequence problem would just be the k=1 case.

A method that doesn't work.

If we perform the LCS algorithm above, we can get the alignment from the dynamic programming table and check whether those symbols appear in non-overlapping substrings of length k in both input sequences, and delete them if not. The problem is that this doesn't give an optimal solution.

Simd
  • 19,447
  • 42
  • 136
  • 271
  • What is the scale of `k`? My gut tells me the solution would be exponential in `k`, but I have no proof what so ever for this claim. – amit Mar 17 '14 at 09:54
  • @amit k can be up to n. I feel there should be an O(n^2) solution based on dynamic programming but I can't work it out yet. – Simd Mar 17 '14 at 09:56
  • And for `aaa`, `aaa` with `k=2` longest subsequence should be `aa`, right? (If no, what is the longest sequence in `aaa_bc_bc`, `bc:bc:aaa`?) – amit Mar 17 '14 at 10:08
  • @amit aaa, aaa with k=2 longest subsequence is aa as you say. Yes. – Simd Mar 17 '14 at 10:09

1 Answers1

2

The correction is basically when the two strings in the relevant index have a matching substring (instead of a matching letter, as it is now).

The idea is instead of simply checking for a substring of size 1 in the original solution, check for a substring of length k, and add 1 to the solution (and 'jump' by k in the string).

The formula for the recursive approach that should be translated to the DP solution is:

f(i,0) = 0
f(0,j) = 0
f(i,j) = max{
          f(i-k,j-k) + aux(s1,s2,i,j,k)
          f(i-1,j)
          f(i,j-1)
             }

where aux(s1,s2,i,j,k) is a function that is aimed to check if the two substrings are a match, and is defined as:

aux(s1,s2,i,j,k) = 1         | if s1.substring(i-k,i) equals s2.substring(j-k, j)
                   -infinity | otherwise

You can reconstruct the alignment later using an auxilary matrix that marks the choices of max{}, and go from last to first when the matrix is complete.

Example:

bcbac and cbcba. k=2

Matrix generated by f:

      c   b   c  b   a
   0  0   0   0  0   0
b  0  0   0   0  0   0
c  0  0   0   1  1   1   
b  0  0   1   1  1   1
a  0  0   1   1  1   2
c  0  0   1   1  1   2

And for reproducing the alignment you generate 'choices' matrix:

1 - chose f(i-k,j-k) + aux(s1,s2,i,j,k)
2 - chose f(i-1,j)
3 - chose f(i,j-1)
d - don't care, (all choices are fine)
x/y -means one of x or y.

c      b      c     b      a

b   2/3    2/3    2/3   2/3    2/3
c   2/3    2/3     1     2      2   
b   2/3     3     2/3    d     2/3
a   2/3     3     2/3   2/3     1
c   2/3     3     2/3   2/3     3

Now, reconstructing the alignment - start from the last (bottom right) cell:

  1. It's '3' - move up, don't add anything to the alignment.
  2. It's 1 - we need to add 'ba' to the alignment. (alignment='ba' currently). move k up and left.
  3. It's 1, ad 'bc' to the alignment. current alignment: 'bcba'. move k up and left.
  4. It's 2/3 - move left OR up.

The order of visiting while reconstructing is: (0 means - not visited, number in many cells means any of these is OK).

      c   b   c  b   a
   0  4   0   0  0   0
b  4  3   0   0  0   0
c  0  0   0   0  0   0   
b  0  0   0   2  0   0
a  0  0   0   0  0   0
c  0  0   0   0  0   1
amit
  • 175,853
  • 27
  • 231
  • 333
  • Can this be made O(n^2) time? Each call to aux seems to take linear time in the worst case I think and there are roughly n^2 such calls. – Simd Mar 17 '14 at 10:22
  • @felix This is done on O(n*m*k) (`n,m` are the lengths of the two strings). Checking equality of the relevant substrings is `O(k)`, so each call to `aux()` is `O(k)` (worst case). – amit Mar 17 '14 at 10:23
  • Right. Do you think it can be done in O(nm) time instead? Say k = n/1000. – Simd Mar 17 '14 at 10:34
  • I suppose aux can be precomputed as a table in O(nm) time, right? – Simd Mar 17 '14 at 10:46
  • How could you extract an actual alignment? Can you do the usual predecessor trick? – Simd Mar 17 '14 at 10:50
  • I need to think about reducing the `k` factor, I think it is possible - not sure how. Finding the actual alignment can be done using an extra table that marks at each point what decision did `max()` make, and when the two tables are complete, retrace it backward from end to start. An alternative is using the original table, as described similarly in [this thread](http://stackoverflow.com/q/7489398/572670) (for knapsack, but the principle remains). – amit Mar 17 '14 at 10:55
  • About using this extra table. Consider bcbac and cbcba and set k = 2. If we look at the first three symbols of each (bcb and cbc) we see two options for a LCS of length 2. Either bc or cb. The problem is that bc can be extended to bcba but cb can't be. So it doesn't seem so easy to know which to choose. – Simd Mar 17 '14 at 11:02
  • Sure it is, think what the algorithm will do. It will set d[2][3]=1 (cb), d[3][2]=1 (bc). Now, when filling d[3][3] - one of d[2][3] will be chosen arbitrarly, but one of them WILL be chosen, or (2,3) (X)OR (3,2). Set the symbol for the one that will be chosen, and it will get you to a correct maximal alignment. – amit Mar 17 '14 at 11:06
  • I don't quite understand. Say we choose (2,3) which corresponds to using cb. When we then go on to extend to bcbac, cbcba how do we ever get the LCS bcba? Didn't we need to choose (3,2) instead as our predecessor? – Simd Mar 17 '14 at 11:27
  • @felix sorry, took wrong strings. look at the edit - I tried to clarify this issue. – amit Mar 17 '14 at 12:27