13

Problem: Need the Length of the LCS between two strings. The size of the strings is at most 100 characters. The alphabet is the usual DNA one, 4 characters "ACGT". The dynamic approach is not quick enough.

My problem is that I am dealing with lot's and lot's of pairs (of the rank of hundreds of million as far as I can see). I believe I have decreased the calling of the LCS_length function to the minimum possible so the only other way to make my program run faster is to have a more efficient LCS_Length function.

I have started off by implementing in the usual dynamic programming approach. That gives the correct answer and is hopefully implemented in properly.

#define arrayLengthMacro(a) strlen(a) + 1
#define MAX_STRING 101

static int MaxLength(int lengthA, int lengthB);

/* 
 * Then the two strings are compared following a dynamic computing
 * LCS table algorithm. Since we only require the length of the LCS 
 * we can get this rather easily.
 */
int LCS_Length(char *a, char *b)
{
    int lengthA = arrayLengthMacro(a),lengthB = arrayLengthMacro(b), 
        LCS = 0, i, j, maxLength, board[MAX_STRING][MAX_STRING];

        maxLength = MaxLength(lengthA, lengthB);

    //printf("%d %d\n", lengthA, lengthB);
    for (i = 0; i < maxLength - 1; i++)
    {
        board[i][0] = 0;
        board[0][i] = 0;
    }

    for (i = 1; i < lengthA; i++)
    {
        for (j = 1; j < lengthB; j++)
        {
/* If a match is found we allocate the number in (i-1, j-1) incremented  
 * by 1 to the (i, j) position
 */
            if (a[i - 1] == b[j - 1])
            {

                board[i][j] = board[i-1][j-1] + 1;
                if(LCS < board[i][j])
                {
                    LCS++;
                }
            }
            else
            {
                if (board[i-1][j] > board[i][j-1])
                {
                    board[i][j] = board[i-1][j];
                }
                else
                {
                    board[i][j] = board[i][j-1];
                }
            }
        }
    }

    return LCS;
}

That should be O(mn) (hopefully).

Then in search for speed I found this post Longest Common Subsequence Which gave the O(ND) paper by Myers. I tried this which relates the LCS with the shortest edit script (SES). The relation they give is D = M + N - 2L, where D is the length of the SES, M and N are the lengths of the two strings and L is the LCS length. But this isn't always correct in my implementation. I give my implementation (which I think is correct):

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define arrayLengthMacro(a) strlen(a)

int LCS_Length (char *a, char *b);
int MinLength (int A, int B);
int Max (int A, int B);
int snake(int k, int max, char *a, char *b, int lengthA, int lengthB);

int main(void)
{
    int L;
    char a[] = "tomato", b[] = "potato"; //must give LCS = 4
    L =  LCS_Length(a, b);
    printf("LCS: %d\n", L );    
    char c[] = "GTCGTTCGGAATGCCGTTGCTCTGTAAA", d[] = "ACCGGTCGAGTGCGCGGAAGCCGGCCGAA"; // must give LCS = 20
    L =  LCS_Length(c, d);
    printf("LCS: %d\n", L );
    char e[] = "human", f[] = "chimpanzee"; // must give LCS = 4
    L =  LCS_Length(e, f);
    printf("LCS: %d\n", L );
    char g[] = "howareyou", h[] = "whoareyou"; // LCS =8
    L =  LCS_Length(g, h);
    printf("LCS: %d\n", L );
    char i[] = "TTCTTTCGGTAACGCCTACTTTATGAAGAGGTTACATTGCAATCGGGTAAATTAACCAACAAGTAATGGTAGTTCCTAGTAGAGAAACCCTCCCGCTCAC", 
        j[] = "GCACGCGCCTGTTGCTACGCTCTGTCCATCCTTTGTGTGCCGGGTACTCAGACCGGTAACTCGAGTTGCTATCGCGGTTATCAGGATCATTTACGGACTC"; // 61
    L =  LCS_Length(i, j);
    printf("LCS: %d\n", L );


    return 0;
}

int LCS_Length(char *a, char *b)
{

    int D, lengthA = arrayLengthMacro(a), lengthB = arrayLengthMacro(b), 
        max, *V_, *V, i, k, x, y;

    max = lengthA + lengthB;
    V_ = malloc(sizeof(int) * (max+1));
    if(V_ == NULL)
    {
        fprintf(stderr, "Failed to allocate memory for LCS");
        exit(1);
    }
    V = V_ + lengthA;
    V[1] = 0;

    for (D = 0; D < max; D++)
    {
        for (k = -D; k <= D; k = k + 2)
        {
            if ((k == -D && V[k-1] < V[k+1]) || (k != D && V[k-1] < V[k+1]))
            {
                x = V[k+1];
            }
            else
            {
                x = V[k-1] + 1;
            }
            y = x - k;
            while ((x < lengthB) && (y < lengthA) && (a[x+1] == b[y+1]))
            {
                x++;
                y++;
            }
            V[k] = x;
            if ((x >= lengthB) && (y >= lengthA))
            {
                return (lengthA + lengthB - D)/2;
            }
        }
    }
    return  (lengthA + lengthB - D)/2;
}

There are examples in the main. Eg. "tomato" and "potato" (don't comment), have an LCS length of 4. The implementation finds that it is 5 sice the SES (D in the code) comes out as 2 instead of 4 that I'd expect (delete "t", insert "p", delete "m", insert "t"). I am inclined to think that maybe the O(ND) algorithm counts substitutions as well, but I am not sure about this.

Any approach that is implementable (I don't have immense programming skills), is welcome. (If someone would know how to take advantage of the small alphabet for example).

EDIT: I think it would be useful to say on top of everything else, that I use the LCS function between ANY two strings at ANY time. So it is not only string say s1, compared with few million others. It might be s200 with s1000, then s0 with s10000, and then 250 with s100000. Not likely to be able to track most used strings either. I require that the LCS length is NOT an approximation, since I am implementing an approximation algorithm and I don't want to add extra error.

EDIT: Just ran callgrind. For an input of 10000 strings I seem to call the lcs function about 50,000,000 times, for different pairs of strings. (10000 strings is the lowest I want to run and the max is 1 million (if that is feasible)).

Community
  • 1
  • 1
Yiannis
  • 131
  • 1
  • 6
  • In the second version, you are not deleting `V_`, which will certainly cause problems if you call this function hundreds of millions of times. You should in any case avoid allocating memory every time the function is called, as it will slow you down. In your case I would just declare a fixed-length array `int V_[2*MAX_STRING+2]` on the stack. – TonyK Jul 02 '11 at 08:43
  • Space used by LCS is O(mn). You can bring it down to O(n), it can make your program much faster, because hit ratio for cache will increase. However, i generally use this if n,m~10^3. You can try [this](http://www.ideone.com/4rG7k) code which i wrote sometime ago and see the difference, if any, in time. – Priyank Bhatnagar Jul 02 '11 at 09:08
  • 2
    @logic_max: that is a nice modification of the algorithm, but the copy on line 19 will undo any of the performance advantages of avoiding cache hits (which I imagine aren't too terrible for a ~40kb array anyway). You could avoid that copy by keeping two int pointers, L and L_old that you swap at each iteration. – Julian Panetta Jul 02 '11 at 09:30
  • @Julian Panetta : That is the reason I didn't post my reduced space implementation. Because of that copying that was messing up things. Your approach is very nice. – Yiannis Jul 02 '11 at 09:35
  • @Julian : Yes, that would be really good :). For n,m~10^3, it gave me about 1 sec advantage, but now it could be better, thanks :). – Priyank Bhatnagar Jul 02 '11 at 10:35
  • look at LCSS algo in this paper: http://www.eecs.umich.edu/db/files/sigmod07timeseries.pdf If you want a Java implementation let me know b/c I have it – Adrian Jul 30 '14 at 21:25

3 Answers3

2

I'm not familiar with the fancier-than-dynamic-programming algorithms for computing LCS, but I wanted to point out a few things:

First, the O(ND) approach only makes sense if you're comparing very large, very similar strings. This doesn't seem to be the case for you.

Second, speeding up the asymptotic performance of your LCD_Length function is probably not what you should be focusing on since your strings are pretty short. If you only care about finding similar or dissimilar pairs (and not all pairs' exact LCS), then the BK-tree mentioned by Yannick looks like a promising way to go.

Finally, some things bothered me about your DP implementation. The correct interpretation of "board[i][j]" in your code is "the longest subsequence of strings a[1..i], b[1..j]" (I'm using 1-indexing in this notation). Therefore, your for loops should include i = lengthA and j = lengthB. It looks like you hacked around this bug in your code by introducing arrayLengthMacro(a), but that hack doesn't make sense in the context of the algorithm. Once "board" is filled, you can look up the solution in board[lengthA][lengthB], which means you can get rid of the unnecessary "if (LCS < board[i][j])" block and return board[lengthA][lengthB]. Also, the loop bounds look wrong in the initialization (I'm pretty sure it should be for (i = 0; i <= maxLength; i++) where maxLength = max(lengthA, lengthB)).

Julian Panetta
  • 391
  • 1
  • 6
  • @ Julian Panetta : thanks for all your points. Indeed the O(ND) would not be a very good choice (it was more like a trial I guess). On the second point, I need the exact LCS length. And on the third point, thanks about the things you noted on the code. I have some way to go till I get the experience and ease with programming logic. – Yiannis Jul 02 '11 at 10:09
  • Oh, okay, since you need all pairs' LCS lengths, I think you should try Yannick's suggestion of reusing the DP table when comparing one string, A, to all other strings, B. A good way to structure this is to build a trie dictionary and run a DFS for each A. Then, each time you descend down the trie (read a letter from B), you fill out a row of the table. Each time you backtrack you decrement your row index. Every time you hit a word node in the trie, you've computed an LCS for (A, B). This is equivalent to sorting the strings, but with cleaner code. NOTE: I'm now using the B index for rows. – Julian Panetta Jul 02 '11 at 20:00
  • I am kind of hesitant in storing things, because the whole process is space consuming already. When I have 1 million strings for example, what chunk of memory would that trie take? – Yiannis Jul 03 '11 at 04:41
  • ...One of the uses I have of the LCS is to use as a metric to run Prim's algorithm (so each string is a node in a potentially vast complete graph with vertex weights given by the LCS length). I have done that with Fib heap which runs reasonably well. Reducing the number of calls to the LCS length is more helpful since as you have well spotted, with small strings (25 to 100) it might not be very important to improve the efficiency of the actual function. Does reusing the DP table with a trie help under this scope? (I am using the LCS distance elsewhere as well and again very frequently). – Yiannis Jul 03 '11 at 05:00
  • The space the trie occupies is very closely linked to how much of a speedup reusing the DP table gives you. In the (impossibly bad) worst case where no strings share prefixes, there will be a node per character in your dictionary. Each node needs to store 4 pointers/indices (for outbound A, C, T, and G edges) plus a flag specifying whether it is a word or not. So this could mean as much as (10^6*100 nodes) * ~16 bytes/node ~= 1.6gb. – Julian Panetta Jul 06 '11 at 03:51
  • But you would hope many of the strings have long prefixes in common, which will vastly reduce the size of the tree. Following the DFS-based algorithm I described, the complexity of computing LCS(A, B) for all B is O(nodes in trie * length(A)). So you can see how the trie footprint and the speedup are related: the time saved by reusing the table is O((total characters in dictionary - nodes in trie) * length(A)). – Julian Panetta Jul 06 '11 at 03:59
  • I've been thinking about how you can use a BK tree to find nearest neighbors of a query (for Prim's algorithm) instead of all neighbors of some distance away. I'm not convinced a BK tree will give you a worthwhile reduction of comparisons, especially when you factor in the overhead of construction and maintenance (you'll need to remove words from the tree at each iteration of Prim's algorithm--you need to do this too for the trie approach, but that's trivial). There are a few other metric trees on Wikipedia (cover trees looked promising), but it's not immediately obvious how helpful they are. – Julian Panetta Jul 06 '11 at 04:12
1

There several ways to make your computation faster:

  • Instead of plain dynamic programming, you can use A* search (by using a heuristic that partial alignment up to (i,j) must necessarily have |i-j| deletions or insertions in it).
  • If you're comparing one sequence with a whole bunch of other ones, you can save time by computing the dynamic programming table (or the A* search state) for the part leading up to that prefix and re-use the part of your computation. If you stick with the dynamic programming table, you could sort the library of strings by lexicographic order and then only throw away the part that changes (e.g. if you have 'banana' and want to compare it with 'panama' and 'panamericanism', you can reuse the part of the table that covers 'panam').
  • If most of the strings are quite similar, you can save time by looking for a common prefix and excluding the common prefix from the computation (e.g. the LCS of "panama" and "panamericanism" is the common prefix "panam" plus the LCS of "a" and "ericanism")
  • if the strings are quite dissimilar, you can use the symbol counts to get a lower bound on the number of edits (e.g., "AAAB" to "ABBB" need at least 2 edits because there are 3 As in one and only 1 A in the other). This can also be used in the heuristic for an A* search.

EDIT: for the comparing-to-the-same-set-of-stings case, one person suggested a BK-Tree data structure in Efficient way of calculating likeness scores of strings when sample size is large?

Community
  • 1
  • 1
Yannick Versley
  • 780
  • 3
  • 10
1

I'd recommend getting hold of a copy of Gusfield's Algorithms on Strings, Trees and Sequences which is all about string operations for computational biology.

borrible
  • 17,120
  • 7
  • 53
  • 75