21

I searched online for a C++ Longest Common Substring implementation but failed to find a decent one. I need a LCS algorithm that returns the substring itself, so it's not just LCS.

I was wondering, though, about how I can do this between multiple strings.

My idea was to check the longest one between 2 strings, and then go check all the others, but this is a very slow process which requires managing many long strings on the memory, making my program quite slow.

Any idea of how this can be speeded up for multiple strings? Thank you.

Important Edit One of the variables I'm given determines the number of strings the longest common substring needs to be in, so I can be given 10 strings, and find the LCS of them all (K=10), or LCS of 4 of them, but I'm not told which 4, I have to find the best 4.

chema989
  • 3,962
  • 2
  • 20
  • 33
David Gomes
  • 5,644
  • 16
  • 60
  • 103
  • 2
    If you need to do this with multiple strings, then you should not follow your approach. Consider that the LCS overall might not be a subset of the LCS between two particular strings [ej. "123asdfg", "asdfg123", "123"; if you run LCS on the first two you will get "asdfg", that has no characters in common with the last string]. As of returning the actual LCS substring, the common algorithm ends with a table that you can walk to create such a string in linear time (on the size of the LCS) – David Rodríguez - dribeas Apr 20 '12 at 15:07
  • http://www.markusstengel.de/text/en/i_4_1_5_3.html – Nick Apr 20 '12 at 15:36
  • Check here for [Analysis of Longest common substring matching](http://www.msccomputerscience.com/2014/10/analysis-of-longest-common-substring_18.html) – ARJUN Oct 21 '14 at 06:06

7 Answers7

14

Here is an excellent article on finding all common substrings efficiently, with examples in C. This may be overkill if you need just the longest, but it may be easier to understand than the general articles about suffix trees.

Adrian McCarthy
  • 45,555
  • 16
  • 123
  • 175
  • Very simple article on suffix arrays, suitable for this problem. Got me up and running quickly. Very nice. – chowey Mar 26 '15 at 18:29
10

The answer is GENERALISED SUFFIX TREE. http://en.wikipedia.org/wiki/Generalised_suffix_tree

You can build a generalised suffix tree with multiple string.

Look at this http://en.wikipedia.org/wiki/Longest_common_substring_problem

The Suffix tree can be built in O(n) time for each string, k*O(n) in total. K is total number of strings.

So it's very quick to solve this problem.

Lxcypp
  • 101
  • 3
4

This is a dynamic programming problem and can be solved in O(mn) time, where m is the length of one string and n is of other.

Like any other problem solved using Dynamic Programming, we will divide the problem into subproblem. Lets say if two strings are x1x2x3....xm and y1y2y3...yn

S(i,j) is the longest common string for x1x2x3...xi and y1y2y3....yj, then

S(i,j) = max { length of longest common substring ending at xi/yj, if ( x[i] == y[j] ), S(i-1, j-1), S(i, j-1), S(i-1, j) }

Here is working program in Java. I am sure you can convert it to C++.:

public class LongestCommonSubstring {

    public static void main(String[] args) {
        String str1 = "abcdefgijkl";
        String str2 = "mnopabgijkw";
        System.out.println(getLongestCommonSubstring(str1,str2));
    }

    public static String getLongestCommonSubstring(String str1, String str2) {
        //Note this longest[][] is a standard auxialry memory space used in Dynamic
                //programming approach to save results of subproblems. 
                //These results are then used to calculate the results for bigger problems
        int[][] longest = new int[str2.length() + 1][str1.length() + 1];
        int min_index = 0, max_index = 0;

                //When one string is of zero length, then longest common substring length is 0
        for(int idx = 0; idx < str1.length() + 1; idx++) {
            longest[0][idx] = 0;
        }

        for(int idx = 0; idx < str2.length() + 1; idx++) {
            longest[idx][0] = 0;
        }

        for(int i = 0; i <  str2.length(); i++) {
            for(int j = 0; j < str1.length(); j++) {

                int tmp_min = j, tmp_max = j, tmp_offset = 0;

                if(str2.charAt(i) == str1.charAt(j)) {
                    //Find length of longest common substring ending at i/j
                    while(tmp_offset <= i && tmp_offset <= j &&
                            str2.charAt(i - tmp_offset) == str1.charAt(j - tmp_offset)) {

                        tmp_min--;
                        tmp_offset++;

                    }
                }
                //tmp_min will at this moment contain either < i,j value or the index that does not match
                //So increment it to the index that matches.
                tmp_min++;

                //Length of longest common substring ending at i/j
                int length = tmp_max - tmp_min + 1;
                //Find the longest between S(i-1,j), S(i-1,j-1), S(i, j-1)
                int tmp_max_length = Math.max(longest[i][j], Math.max(longest[i+1][j], longest[i][j+1]));

                if(length > tmp_max_length) {
                    min_index = tmp_min;
                    max_index = tmp_max;
                    longest[i+1][j+1] = length;
                } else {
                    longest[i+1][j+1] = tmp_max_length;
                }


            }
        }

        return str1.substring(min_index, max_index >= str1.length() - 1 ? str1.length() - 1 : max_index + 1);
    }
}
davidbuzatto
  • 9,207
  • 1
  • 43
  • 50
snegi
  • 636
  • 1
  • 6
  • 22
3

There is a very elegant Dynamic Programming solution to this.

Let LCSuff[i][j] be the longest common suffix between X[1..m] and Y[1..n]. We have two cases here:

  • X[i] == Y[j], that means we can extend the longest common suffix between X[i-1] and Y[j-1]. Thus LCSuff[i][j] = LCSuff[i-1][j-1] + 1 in this case.

  • X[i] != Y[j], since the last characters themselves are different, X[1..i] and Y[1..j] can't have a common suffix. Hence, LCSuff[i][j] = 0 in this case.

We now need to check maximal of these longest common suffixes.

So, LCSubstr(X,Y) = max(LCSuff(i,j)), where 1<=i<=m and 1<=j<=n

The algorithm pretty much writes itself now.

string LCSubstr(string x, string y){
    int m = x.length(), n=y.length();

    int LCSuff[m][n];

    for(int j=0; j<=n; j++)
        LCSuff[0][j] = 0;
    for(int i=0; i<=m; i++)
        LCSuff[i][0] = 0;

    for(int i=1; i<=m; i++){
        for(int j=1; j<=n; j++){
            if(x[i-1] == y[j-1])
                LCSuff[i][j] = LCSuff[i-1][j-1] + 1;
            else
                LCSuff[i][j] = 0;
        }
    }

    string longest = "";
    for(int i=1; i<=m; i++){
        for(int j=1; j<=n; j++){
            if(LCSuff[i][j] > longest.length())
                longest = x.substr((i-LCSuff[i][j]+1) -1, LCSuff[i][j]);
        }
    }
    return longest;
}
Ankesh Anand
  • 1,695
  • 2
  • 16
  • 24
  • Can you explain the last part where you find out the common substring? – SexyBeast Mar 25 '16 at 21:15
  • @SexyBeast We know that `LCS[i][j]` gives the length of the longest common suffix ending at index `i-1` for string `x` and ending at index `j-1` for string `y`. So finding the common suffix is just a matter of getting a suffix of length `LCS[i][j]` from either of the strings. The answer above chooses to use the first string `x` to that effect. – Quirk Mar 29 '17 at 12:18
  • 1
    There's a bug in the code above, but edit is not possible (due to rules on min possible size of edit). LCSuff should be of size (m+1, n+1) not (m, n). The bottom part could also easily be improved by keeping track of max current substring length and start/end of the substring in one of the strings. The string can than be extracted as e.g. x.substr(start_substr, len_substr). – recodeFuture Apr 17 '18 at 20:31
0

Here is a C# version to find the Longest Common Substring using dynamic programming of two arrays (you may refer to: http://codingworkout.blogspot.com/2014/07/longest-common-substring.html for more details)

class LCSubstring
        {
            public int Length = 0;
            public List<Tuple<int, int>> indices = new List<Tuple<int, int>>();
        }
        public string[] LongestCommonSubStrings(string A, string B)
        {
            int[][] DP_LCSuffix_Cache = new int[A.Length+1][];
            for (int i = 0; i <= A.Length; i++)
            {
                DP_LCSuffix_Cache[i] = new int[B.Length + 1];
            }
            LCSubstring lcsSubstring = new LCSubstring();
            for (int i = 1; i <= A.Length; i++)
            {
                for (int j = 1; j <= B.Length; j++)
                {
                    //LCSuffix(Xi, Yj) = 0 if X[i] != X[j]
                    //                 = LCSuffix(Xi-1, Yj-1) + 1 if Xi = Yj
                    if (A[i - 1] == B[j - 1])
                    {
                        int lcSuffix = 1 + DP_LCSuffix_Cache[i - 1][j - 1];
                        DP_LCSuffix_Cache[i][j] = lcSuffix;
                        if (lcSuffix > lcsSubstring.Length)
                        {
                            lcsSubstring.Length = lcSuffix;
                            lcsSubstring.indices.Clear();
                            var t = new Tuple<int, int>(i, j);
                            lcsSubstring.indices.Add(t);
                        }
                        else if(lcSuffix == lcsSubstring.Length)
                        {
                            //may be more than one longest common substring
                            lcsSubstring.indices.Add(new Tuple<int, int>(i, j));
                        }
                    }
                    else
                    {
                        DP_LCSuffix_Cache[i][j] = 0;
                    }
                }
            }
            if(lcsSubstring.Length > 0)
            {
                List<string> substrings = new List<string>();
                foreach(Tuple<int, int> indices in lcsSubstring.indices)
                {
                    string s = string.Empty;
                    int i = indices.Item1 - lcsSubstring.Length;
                    int j = indices.Item2 - lcsSubstring.Length;
                    Assert.IsTrue(DP_LCSuffix_Cache[i][j] == 0);
                    for(int l =0; l<lcsSubstring.Length;l++)
                    {
                        s += A[i];
                        Assert.IsTrue(A[i] == B[j]);
                        i++;
                        j++;
                    }
                    Assert.IsTrue(i == indices.Item1);
                    Assert.IsTrue(j == indices.Item2);
                    Assert.IsTrue(DP_LCSuffix_Cache[i][j] == lcsSubstring.Length);
                    substrings.Add(s);
                }
                return substrings.ToArray();
            }
            return new string[0];
        }

Where unit tests are:

[TestMethod]
        public void LCSubstringTests()
        {
            string A = "ABABC", B = "BABCA";
            string[] substrings = this.LongestCommonSubStrings(A, B);
            Assert.IsTrue(substrings.Length == 1);
            Assert.IsTrue(substrings[0] == "BABC");
            A = "ABCXYZ"; B = "XYZABC";
            substrings = this.LongestCommonSubStrings(A, B);
            Assert.IsTrue(substrings.Length == 2);
            Assert.IsTrue(substrings.Any(s => s == "ABC"));
            Assert.IsTrue(substrings.Any(s => s == "XYZ"));
            A = "ABC"; B = "UVWXYZ";
            string substring = "";
            for(int i =1;i<=10;i++)
            {
                A += i;
                B += i;
                substring += i;
                substrings = this.LongestCommonSubStrings(A, B);
                Assert.IsTrue(substrings.Length == 1);
                Assert.IsTrue(substrings[0] == substring);
            }
        }
Dreamer
  • 3,371
  • 2
  • 34
  • 50
0

I tried several different solutions for this but they all seemed really slow so I came up with the below, didn't really test much, but it seems to work a bit faster for me.

#include <iostream>

std::string lcs( std::string a, std::string b )
{
    if( a.empty() || b.empty() ) return {} ;

    std::string current_lcs = "";

    for(int i=0; i< a.length(); i++) {
        size_t fpos = b.find(a[i], 0);
        while(fpos != std::string::npos) {
            std::string tmp_lcs = "";
            tmp_lcs += a[i];
            for (int x = fpos+1; x < b.length(); x++) {
                tmp_lcs+=b[x];
                size_t spos = a.find(tmp_lcs, 0);
                if (spos == std::string::npos) {
                    break;
                } else {
                    if (tmp_lcs.length() > current_lcs.length()) {
                        current_lcs = tmp_lcs;
                    }
                }
            }
            fpos = b.find(a[i], fpos+1);
        }
    }
    return current_lcs;
}

int main(int argc, char** argv)
{
    std::cout << lcs(std::string(argv[1]), std::string(argv[2])) << std::endl;
}
-6

Find the largest substring from all strings under consideration. From N strings, you'll have N substrings. Choose the largest of those N.

Iceman
  • 4,202
  • 7
  • 26
  • 39