24

I am trying to find the longest common subsequence of 3 or more strings. The Wikipedia article has a great description of how to do this for 2 strings, but I'm a little unsure of how to extend this to 3 or more strings.

There are plenty of libraries for finding the LCS of 2 strings, so I'd like to use one of them if possible. If I have 3 strings A, B and C, is it valid to find the LCS of A and B as X, and then find the LCS of X and C, or is this the wrong way to do it?

I've implemented it in Python as follows:

import difflib

def lcs(str1, str2):
    sm = difflib.SequenceMatcher()
    sm.set_seqs(str1, str2)
    matching_blocks = [str1[m.a:m.a+m.size] for m in sm.get_matching_blocks()]
    return "".join(matching_blocks)

print reduce(lcs, ['abacbdab', 'bdcaba', 'cbacaa'])

This outputs "ba", however it should be "baa".

del
  • 6,341
  • 10
  • 42
  • 45
  • 2
    It's "baa". He wants longest common **subsequence**, not longest common substring. – ypercubeᵀᴹ Feb 20 '11 at 13:27
  • 3
    The iterative approach you mention will not work. For example it will give wrong results for `abc12`, `12abc`, `12xyz`. – sth Feb 20 '11 at 13:34
  • 2
    Please note that there may be more than 1 lcs for any two strings. For example, for your first two strings `('abacbdab', 'bdcaba')`, there are at least five lcs: `"baa", "bab", "bca", "bcb", "dab"`. Does your code take that into account? – ypercubeᵀᴹ Feb 20 '11 at 13:35
  • 2
    Oops, finding lcs by hand is not so trivial! There are three, `"bcab", "bdab", "bcba"` in the above example, which explains why your code can't find `"baa"`. But the point is still valid, there can be **more than one** lcs for 2 strings. – ypercubeᵀᴹ Feb 20 '11 at 13:40
  • @ypercube - Good point, my code doesn't take that in to account since difflib only gives you one LCS. – del Feb 20 '11 at 13:43
  • This is a very well known problem. Just google it. see [http://www.springerlink.com/content/fu4t4442l7577712/](http://www.springerlink.com/content/fu4t4442l7577712/) where two algorithms for this problem are presented. Also look at [http://www.perlmonks.org/?node_id=548827](http://www.perlmonks.org/?node_id=548827) (many other links are available at google) – Luixv Feb 20 '11 at 13:24

5 Answers5

33

Just generalize the recurrence relation.

For three strings:

dp[i, j, k] = 1 + dp[i - 1, j - 1, k - 1] if A[i] = B[j] = C[k]
              max(dp[i - 1, j, k], dp[i, j - 1, k], dp[i, j, k - 1]) otherwise

Should be easy to generalize to more strings from this.

IVlad
  • 43,099
  • 13
  • 111
  • 179
  • 1
    I adapted the LCS code from here as per your suggestion: http://code.activestate.com/recipes/576869-longest-common-subsequence-problem-solver/. It works but is extremely slow and I run into maximum recursion depth errors on longer sequences. I was hoping there is a library that implements this already. – del Feb 20 '11 at 14:56
  • 2
    @del - it's not going to get much faster, because the problem has no better exact solution. – IVlad Feb 20 '11 at 16:24
  • @del, it can be programmed iteratively instead of recursively. [The pseudocode on Wikipedia](http://en.wikipedia.org/wiki/Longest_common_subsequence_problem#Code_for_the_dynamic_programming_solution) even gives the iterative solution. Just extend this to 3 dimensions, or k dimensions if you feel like it. The dynamic programming algorithm is O(Π_{i=1,k} n_k) in time where n_i is the length of the ith of k input strings (O(n^k) if all strings same length). So, yes, it's slow. Space complexity can be O([Π_{i=1,k} n_k]/[argmax_{i=1,k} n_k]) (if all strings same length, O(n^{k-1})). – rlibby Feb 20 '11 at 21:47
  • I have posted a solution which uses dynamic programming to save immediate results for a faster solution and less memory usage. – Kent Munthe Caspersen Nov 27 '13 at 19:30
  • @KentMuntheCaspersen - this uses dynamic programming as well (or CAN use it, if you implement the recurrence right). Yours seems to use backtracking? It's hard to tell from your code what you're doing exactly, but I doubt it's faster. – IVlad Nov 27 '13 at 21:05
  • Miy solution uses recursion and works for n strings. It uses a n-dimensional array, where the entry [i, j, k, ...] stores the solution to the problem of finding the LCS between string1(0,i), string2(0,j), string3(0,k) ... – Kent Munthe Caspersen Nov 27 '13 at 21:18
  • I wrote it fast so it is not very clean, but I decided to post it anyways since I could not find an implementation anywhere else. Its complexity is O(n^m), where n is maximum string length and m is number of strings. Your solution is simple and powerful, it perfectly demonstrates the technique. – Kent Munthe Caspersen Nov 27 '13 at 21:27
  • Someone directly copy-pasted your answer [here](http://stackoverflow.com/a/7951566/3375713), without even mentioning your name ;-) – Sнаđошƒаӽ Jan 09 '17 at 16:55
  • Aww, I am flattered :). – IVlad Jan 09 '17 at 16:59
  • Is there any performant solution that uses O(n^2) _memory_ ? This one is painful in memory usage as O(n^k) if there are k strings – WestCoastProjects Feb 08 '21 at 04:14
9

I just had to do this for a homework, so here is my dynamic programming solution in python that's pretty efficient. It is O(nml) where n, m and l are the lengths of the three sequences.

The solution works by creating a 3D array and then enumerating all three sequences to calculate the path of the longest subsequence. Then you can backtrack through the array to reconstruct the actual subsequence from its path.

So, you initialize the array to all zeros, and then enumerate the three sequences. At each step of the enumeration, you either add one to the length of the longest subsequence (if there's a match) or just carry forward the longest subsequence from the previous step of the enumeration.

Once the enumeration is complete, you can now trace back through the array to reconstruct the subsequence from the steps you took. i.e. as you travel backwards from the last entry in the array, each time you encounter a match you look it up in any of the sequences (using the coordinate from the array) and add it to the subsequence.

def lcs3(a, b, c):
    m = len(a)
    l = len(b)
    n = len(c)
    subs = [[[0 for k in range(n+1)] for j in range(l+1)] for i in range(m+1)]

    for i, x in enumerate(a):
        for j, y in enumerate(b):
            for k, z in enumerate(c):
                if x == y and y == z:
                    subs[i+1][j+1][k+1] = subs[i][j][k] + 1
                else:
                    subs[i+1][j+1][k+1] = max(subs[i+1][j+1][k], 
                                              subs[i][j+1][k+1], 
                                              subs[i+1][j][k+1])
    # return subs[-1][-1][-1] #if you only need the length of the lcs
    lcs = ""
    while m > 0 and l > 0 and n > 0:
        step = subs[m][l][n]
        if step == subs[m-1][l][n]:
            m -= 1
        elif step == subs[m][l-1][n]:
            l -= 1
        elif step == subs[m][l][n-1]:
            n -= 1
        else:
            lcs += str(a[m-1])
            m -= 1
            l -= 1
            n -= 1

    return lcs[::-1]
BarthesSimpson
  • 926
  • 8
  • 21
4

To find the Longest Common Subsequence (LCS) of 2 strings A and B, you can traverse a 2-dimensional array diagonally like shown in the Link you posted. Every element in the array corresponds to the problem of finding the LCS of the substrings A' and B' (A cut by its row number, B cut by its column number). This problem can be solved by calculating the value of all elements in the array. You must be certain that when you calculate the value of an array element, all sub-problems required to calculate that given value has already been solved. That is why you traverse the 2-dimensional array diagonally.

This solution can be scaled to finding the longest common subsequence between N strings, but this requires a general way to iterate an array of N dimensions such that any element is reached only when all sub-problems the element requires a solution to has been solved.

Instead of iterating the N-dimensional array in a special order, you can also solve the problem recursively. With recursion it is important to save the intermediate solutions, since many branches will require the same intermediate solutions. I have written a small example in C# that does this:

string lcs(string[] strings)
{
    if (strings.Length == 0)
        return "";
    if (strings.Length == 1)
        return strings[0];
    int max = -1;
    int cacheSize = 1;
    for (int i = 0; i < strings.Length; i++)
    {
        cacheSize *= strings[i].Length;
        if (strings[i].Length > max)
            max = strings[i].Length;
    }
    string[] cache = new string[cacheSize];
    int[] indexes = new int[strings.Length];
    for (int i = 0; i < indexes.Length; i++)
        indexes[i] = strings[i].Length - 1;
    return lcsBack(strings, indexes, cache);
}
string lcsBack(string[] strings, int[] indexes, string[] cache)
{
    for (int i = 0; i < indexes.Length; i++ )
        if (indexes[i] == -1)
            return "";
    bool match = true;
    for (int i = 1; i < indexes.Length; i++)
    {
        if (strings[0][indexes[0]] != strings[i][indexes[i]])
        {
            match = false;
            break;
        }
    }
    if (match)
    {
        int[] newIndexes = new int[indexes.Length];
        for (int i = 0; i < indexes.Length; i++)
            newIndexes[i] = indexes[i] - 1;
        string result = lcsBack(strings, newIndexes, cache) + strings[0][indexes[0]];
        cache[calcCachePos(indexes, strings)] = result;
        return result;
    }
    else
    {
        string[] subStrings = new string[strings.Length];
        for (int i = 0; i < strings.Length; i++)
        {
            if (indexes[i] <= 0)
                subStrings[i] = "";
            else
            {
                int[] newIndexes = new int[indexes.Length];
                for (int j = 0; j < indexes.Length; j++)
                    newIndexes[j] = indexes[j];
                newIndexes[i]--;
                int cachePos = calcCachePos(newIndexes, strings);
                if (cache[cachePos] == null)
                    subStrings[i] = lcsBack(strings, newIndexes, cache);
                else
                    subStrings[i] = cache[cachePos];
            }
        }
        string longestString = "";
        int longestLength = 0;
        for (int i = 0; i < subStrings.Length; i++)
        {
            if (subStrings[i].Length > longestLength)
            {
                longestString = subStrings[i];
                longestLength = longestString.Length;
            }
        }
        cache[calcCachePos(indexes, strings)] = longestString;
        return longestString;
    }
}
int calcCachePos(int[] indexes, string[] strings)
{
    int factor = 1;
    int pos = 0;
    for (int i = 0; i < indexes.Length; i++)
    {
        pos += indexes[i] * factor;
        factor *= strings[i].Length;
    }
    return pos;
}

My code example can be optimized further. Many of the strings being cached are duplicates, and some are duplicates with just one additional character added. This uses more space than necessary when the input strings become large.

On input: "666222054263314443712", "5432127413542377777", "6664664565464057425"

The LCS returned is "54442"

Kent Munthe Caspersen
  • 5,918
  • 1
  • 35
  • 34
4

This below code can find the longest common subsequence in N strings. This uses itertools to generate required index combinations and then use these indexes for finding common substring.

Example Execution:
Input:
Enter the number of sequences: 3
Enter sequence 1 : 83217
Enter sequence 2 : 8213897
Enter sequence 3 : 683147

Output:
837

from itertools import product
import numpy as np
import pdb

def neighbors(index):
    N = len(index)
    for relative_index in product((0, -1), repeat=N):
        if not all(i == 0 for i in relative_index):
            yield tuple(i + i_rel for i, i_rel in zip(index, relative_index))

def longestCommonSubSequenceOfN(sqs):
    numberOfSequences = len(sqs);
    lengths = np.array([len(sequence) for sequence in sqs]);
    incrLengths = lengths + 1;
    lengths = tuple(lengths);
    inverseDistances = np.zeros(incrLengths);
    ranges = [tuple(range(1, length+1)) for length in lengths[::-1]];
    for tupleIndex in product(*ranges):
        tupleIndex = tupleIndex[::-1];
        neighborIndexes = list(neighbors(tupleIndex));
        operationsWithMisMatch = np.array([]);
        for neighborIndex in neighborIndexes:
            operationsWithMisMatch = np.append(operationsWithMisMatch, inverseDistances[neighborIndex]);
        operationsWithMatch = np.copy(operationsWithMisMatch);
        operationsWithMatch[-1] = operationsWithMatch[-1] + 1;
        chars = [sqs[i][neighborIndexes[-1][i]] for i in range(numberOfSequences)];
        if(all(elem == chars[0] for elem in chars)):
            inverseDistances[tupleIndex] = max(operationsWithMatch);
        else:
            inverseDistances[tupleIndex] = max(operationsWithMisMatch);
        # pdb.set_trace();

    subString = "";
    mainTupleIndex = lengths;
    while(all(ind > 0 for ind in mainTupleIndex)):
        neighborsIndexes = list(neighbors(mainTupleIndex));
        anyOperation = False;
        for tupleIndex in neighborsIndexes:
            current = inverseDistances[mainTupleIndex];
            if(current == inverseDistances[tupleIndex]):
                mainTupleIndex = tupleIndex;
                anyOperation = True;
                break;
        if(not anyOperation):
            subString += str(sqs[0][mainTupleIndex[0] - 1]);
            mainTupleIndex = neighborsIndexes[-1];
    return subString[::-1];

numberOfSequences = int(input("Enter the number of sequences: "));
sequences = [input("Enter sequence {} : ".format(i)) for i in range(1, numberOfSequences + 1)];
print(longestCommonSubSequenceOfN(sequences));
2

Here is a link to the solution view explanation here output is Length of LCS is 2

 # Python program to find 
 # LCS of three strings 

 # Returns length of LCS 
 # for X[0..m-1], Y[0..n-1] 
 # and Z[0..o-1] 
def lcsOf3(X, Y, Z, m, n, o): 

    L = [[[0 for i in range(o+1)] for j in range(n+1)] 
        for k in range(m+1)] 

    ''' Following steps build L[m+1][n+1][o+1] in 
    bottom up fashion. Note that L[i][j][k] 
    contains length of LCS of X[0..i-1] and 
    Y[0..j-1] and Z[0.....k-1] '''
   for i in range(m+1): 
    for j in range(n+1): 
        for k in range(o+1): 
            if (i == 0 or j == 0 or k == 0): 
                L[i][j][k] = 0

            elif (X[i-1] == Y[j-1] and
                  X[i-1] == Z[k-1]): 
                L[i][j][k] = L[i-1][j-1][k-1] + 1

            else: 
                L[i][j][k] = max(max(L[i-1][j][k], 
                L[i][j-1][k]), 
                                L[i][j][k-1]) 

      # L[m][n][o] contains length of LCS for 
      # X[0..n-1] and Y[0..m-1] and Z[0..o-1] 
    return L[m][n][o] 

  # Driver program to test above function 

X = 'AGGT12'
Y = '12TXAYB'
Z = '12XBA'

m = len(X) 
n = len(Y) 
o = len(Z) 

print('Length of LCS is', lcsOf3(X, Y, Z, m, n, o)) 

# This code is contributed by Soumen Ghosh.      
Nsikan Sylvester
  • 583
  • 5
  • 14