7

What are good ways to find all the common subsequences of length k of two strings?

Example:

s1= AAGACC

s2= AGATAACCAGGAGCTGC

all common subsequences of length 5: AAGAC AAACC AGACC AAGCC

Camilo Celis Guzman
  • 595
  • 1
  • 5
  • 17
  • 1
    OP, are you familiar with *dynamic programming*? You should find it in any good algorithms book. – Colonel Panic Jun 17 '15 at 11:07
  • Are common subsequences that are equal as strings but different as sequences of source positions considered equal? E.g., in your example, there are 3*15=45 ways of producing the common subsequence `AA`, so should `AA` be output 45 times, or just once? – j_random_hacker Jun 17 '15 at 14:10
  • @j_random_hacker just once. – Camilo Celis Guzman Jun 17 '15 at 15:47
  • In that case, I believe the only way to avoid an O(|A|^k)-space solution (which you would need to record whether each length-k subsequence has been "seen" yet -- inputs like `(AB)`^(2k) seem to force this) is to try to generate length-k subsequences for each string in some (e.g., lexicographical) order, and list-merge them. – j_random_hacker Jun 17 '15 at 16:15
  • In the preceding comment, I meant that |A| is the size of the alphabet. – j_random_hacker Jun 17 '15 at 16:22

3 Answers3

4

One relatively straightforward way would be to reconstruct the sequences from the LCS matrix. Here is an O(n^2 * k + x * n) algorithm to do so, where x is the size of the output (i.e. number of common subsequences of length k). It's in C++, but it should be rather easy to translate to C:

const int N = 100;
int lcs[N][N];
set<tuple<string,int,int,int>> vis;

string s1 = "AAGACC";
string s2 = "AGATAACCAGGAGCTGC";

void reconstruct(const string& res, int i, int j, int k) {
    tuple<string,int,int,int> st(res, i, j, k);
    if (vis.count(st))
        return;
    vis.insert(st);
    if (lcs[i][j] < k) return;
    if (i == 0  && j == 0 && k == 0) {
        cout << res << endl;
        return;
    }
    if (i > 0)
        reconstruct(res, i-1, j, k);
    if (j > 0)
        reconstruct(res, i, j-1, k);
    if (i>0 && j>0 && s1[i-1] == s2[j-1])
        reconstruct(string(1,s1[i-1]) + res, i-1, j-1, k-1);
}

int main() {
    lcs[0][0] = 0;
    for (int i = 0; i <= s1.size(); ++i)
        lcs[i][0] = 0;
    for (int j = 0; j <= s1.size(); ++j)
        lcs[0][j] = 0;
    for (int i = 0; i <= s1.size(); ++i) {
        for (int j = 0; j <= s2.size(); ++j) {
            if (i > 0)
                lcs[i][j] = max(lcs[i][j], lcs[i-1][j]);
            if (j > 0)
                lcs[i][j] = max(lcs[i][j], lcs[i][j-1]);
            if (i > 0 && j > 0 && s1[i-1] == s2[j-1])
                lcs[i][j] = max(lcs[i][j], lcs[i-1][j-1] + 1);
        }
    }
    reconstruct("", s1.size(), s2.size(), 5);
}

There should also be an O(n * (k + x)) way to solve this, based on a slightly different DP approach: Let f(i, k) be the minimum index j such that lcs(i, j) >= k. We have the recurrence

f(i, 0) = 0 for all i
f(i, k) = min{f(i-1, k), 
              minimum j > f(i-1, k-1) such that s2[j] = s1[i]}

We can also reconstruct the sequences of length k from the matrix f.

Niklas B.
  • 92,950
  • 18
  • 194
  • 224
  • I tried it with s1= `ACACTTAGTGGGAGTCTCA` and s2 = `TACGC`, and `k`=`3`, one of the outputs is `GAC`; which is clearly **not** a common subsequence between s1 and s2. – Camilo Celis Guzman Jun 17 '15 at 11:50
  • @CamiloCelisGuzman Interesting. Well in that case I will temporarily roll back to my initial answer, which should be correct – Niklas B. Jun 17 '15 at 13:31
  • Does this generate each string equal to a common subsequence only once? The OP has (belatedly) indicated this as a requirement. If not, I think adapting this will need O(|A|^k) space in the worst case to store the status of whether or not a given string has been seen already (with |A| the alphabet size). – j_random_hacker Jun 17 '15 at 16:23
  • @j_random_hacker No it doesn't. I thought I had another approach based on the second recurrence that is output sensitive, but I think it might not be fixable. I think the problem is not the space consumption, but that asymptotically more sequences have to be generated, only to be filtered out after – Niklas B. Jun 17 '15 at 16:35
  • I think the only way to avoid the space explosion is to generate all length-k subsequences of each string separately in (e.g., lex) order, and intersect them with a list merge. The 2 individual ordered lists can each be generated in O(|A|^k) time by repeated application of the find-next-largest algorithm to string characters. Also there are probably shortcuts (that probably don't reduce the asymptotic complexity) like if there is no CS with prefix X and |X| = i < k then we don't have to generate any of the length-(>i) CSes that begin with X. – j_random_hacker Jun 17 '15 at 16:42
  • And as I understand it, if we generate sequences and filter them out "later", then in the meantime we need to store them in... space ;-) – j_random_hacker Jun 17 '15 at 16:43
  • @j_random_hacker I wouldn't have much problem with the space because time >= space anyways. The problem is the time complexity, which can be much worse if you try to generate all subsequences. An optimal algorithm would have space <= time = O(s + poly(n,k)) where s is the sum of the length of distinct common subsequences – Niklas B. Jun 17 '15 at 16:44
  • Sure, time >= space, but there are worst cases that will produce O(|A|^O(k)) time anyway: when A = B = `XY`^(2k) (2k copies of `XY`), for example. Then by picking either `X` or `Y` from each string at each of the first k adjacent pairs we get all 2^k possible length-k strings on the alphabet `{X, Y}`, and the remaining k adjacent pairs will produce them all a second time (i.e. need to be filtered out). – j_random_hacker Jun 17 '15 at 16:48
  • Yes of course, but output sensitivity is the best you can hope for. I'm not arguing that my algorithm is good, it certainly is not. I'm just formulating the goal of what would be the best possible outcome (i.e. time complexity), when you take the output length as a variable. There is a clear lower bound of Omega(s) for the problem, so O(poly(n,k) + s) would be pretty good. A solution that works by filtering out duplicates can never achieve that, unless you generate exactly O(s) "candidate output", which seems highly unlikely – Niklas B. Jun 17 '15 at 16:51
  • Please take a look at the code in my answer, I think it uses a minimum of both processing time and memory space. – Brent Washburne Jun 18 '15 at 16:07
1

Create a trie of all the subsequences of given length k from s1 and then go over s2 and check for each sequence of length k if it is in the trie.

shapiro yaacov
  • 2,308
  • 2
  • 26
  • 39
  • As @NikklasB. pointed out, problem is there are `O(Choose(n,k))` subsequences (its not substrings), which is in `O(min{n^k,n^(n-k)})` (So, a lot) – amit Jun 17 '15 at 09:33
  • True. Is there a way to achieve this using the LCS (longest common subsequence) memoization matrix? – Camilo Celis Guzman Jun 17 '15 at 09:35
  • @amit - I agree, but in any case you need to check all the options, unless I am missing something. You could do some sophisticated nested looping instead - I'll think about it. – shapiro yaacov Jun 17 '15 at 10:04
  • @shapiro.yaacov No, you need to check only the *common* subsequences, of which there could be asymptotically fewer depending on the input – Niklas B. Jun 17 '15 at 10:05
  • @shapiro.yaacov Not sure, Longest Common Subsequence doesn't require it. I am wondering if you can get only the count more efficiently, which will also make getting the subsequences more efficiently (while still linear in the number of them, which might be exponential in worst cases, but much better often). – amit Jun 17 '15 at 10:06
  • @amit - I didn't suggest LCS – shapiro yaacov Jun 17 '15 at 10:08
  • @shapiro.yaacov I know, I am suggesting there might be a modification for it to do it more efficiently. Creating a exponential number of tries, and chiecking each of them going to take O(n^k * (m+k)) time, doesn't strike me as a good idea. – amit Jun 17 '15 at 10:10
1

Here's a version of the algorithm that uses recursion, a stack of size k and includes two optimizations to skip over characters that have already been seen and to skip sub-subsequences that do not exist. The strings are not unique (there can be duplicates), so run the output through uniq.

#include <stdio.h>
#include <string.h>

/* s1 is the full first string, s2 is the suffix of the second string
 * (starting after the subsequence at depth r),
 * pos is the positions of chars in s1, r is the recursion depth,
 * and k is the length of subsequences that we are trying to match
 */
void recur(char *s1, char *s2, size_t pos[], size_t r, size_t k)
{
    char seen[256] = {0};       /* have we seen a character in s1 before? */
    size_t p0 = (r == 0) ? 0 : pos[r-1] + 1;    /* start at 0 or pos[r-1]+1 */
    size_t p1 = strlen(s1) - k + r;             /* stop at end of string - k + r */
    size_t p;

    if (r == k)         /* we made it, print the matching string */
    {
        for (p=0; p<k; p++)
            putchar(s1[pos[p]]);
        putchar('\n');
        return;
    }

    for (p=p0; p<=p1; p++)      /* at this pos[], loop through chars to end of string */
    {
        char c = s1[p];         /* get the next char in s1 */
        if (seen[c])
            continue;           /* don't go any further if we've seen this char already */
        seen[c] = 1;

        pos[r] = p;
        char *s = strchr(s2, c);        /* is this char in s2 */
        if (s != NULL)
            recur(s1, s+1, pos, r+1, k);        /* recursively proceed with next char */
    }
}


int main()
{
    char *s1 = "AAGACC";
    char *s2 = "AGATAACCAGGAGCTGC";
    size_t k = 5;
    size_t pos[k];

    if (strlen(s1) < k || strlen(s2) < k)       /* make sure we have at least k chars in each string */
        return 1;       /* exit with error */

    recur(s1, s2, pos, 0, k);
    return 0;
}

The output is:

AAGAC
AAGCC
AAACC
AGACC
Brent Washburne
  • 12,904
  • 4
  • 60
  • 82