3

I am trying to implement modified version of Longest Common Substring where i do not want to split the words.

eg:

String s1 = "hi there how are you";
String s2 = "hi there how ar";

So the output should be: o/p: hi there how.

I am first calculating the Longest common Substring and then filtering any words that are split and below is the code for the same:

private static void longestCommonSubstring(String S1, String S2) {
        int Start = 0;
        int Max = 0;
        for (int i = 0; i < S1.length(); i++) {
            for (int j = 0; j < S2.length(); j++) {
                int x = 0;
                while (S1.charAt(i + x) == S2.charAt(j + x)) {
                    x++;
                    if (((i + x) >= S1.length()) || ((j + x) >= S2.length()))
                        break;
                }
                if (x > Max) {
                    Max = x;
                    Start = i;
                }
            }
        }
        System.out.println("S1 " + S1 + " S2 " + S2 + " " + Start + " " + Max);
        System.out.println("ans " + S1.substring(Start, (Start+Max)));


        if(Start != 0){
            if((S1.charAt(Start-1) == ' ')){
                if(Start+Max != S1.length()){
                    if((S1.charAt(Start+Max) == ' ')){
                        System.out.println(S1.substring(Start, (Start + Max)));
                    }
                }
            }
            else if((S1.charAt(Start+Max) == ' ')){
                int index = S1.indexOf(' ', Start);
                System.out.println(S1.substring(index+1, (Start + Max)));
            }
            else{
                int index = S1.indexOf(' ', Start);
                int lastIndex = S1.lastIndexOf(' ', (Start+Max));
                if(index+1 != lastIndex+1){
                    System.out.println(S1.substring(index+1, lastIndex));
                }
                else{
                    System.out.println(" ");
                }
            }
        }
        else if(Start == 0 && Start+Max != S1.length() && (S1.charAt(Start+Max) == ' ')){
            System.out.println(S1.substring(Start, (Start + Max)));
        }
        else if(Start+Max != S1.length()){
            int index = S1.lastIndexOf(' ', (Start+Max));
            System.out.println(S1.substring(Start, index));
        }
        else{
            System.out.println(S1.substring(Start, (Start + Max)));
        }


    }

But it is failing for the cases like:

String s1 = "hi there how are you";
String s2 = "i ere ho ar you";

where the output should have been "you" but is blank, because the longest common Substring is "ere ho".

Thanks for the help.

Wim Deblauwe
  • 25,113
  • 20
  • 133
  • 211
bna
  • 35
  • 3
  • 2
    It does not fail, it _does_ exactly what you have coded so far. If you only want complete words, then check for complete words and not only `charAt`. – Murat Karagöz Jul 07 '15 at 15:03
  • a example at wikipedia looks more clear than your example, so just sharing it https://en.wikipedia.org/wiki/Longest_common_substring_problem#Example – Srinath Ganesh Jul 07 '15 at 15:13
  • possible duplicate of [Java implementation for longest common substring of n strings](http://stackoverflow.com/questions/17150311/java-implementation-for-longest-common-substring-of-n-strings) – austin wernli Jul 07 '15 at 15:16

3 Answers3

2

Based on karthik, and after bna's comment that any character-based answer was flawed:

public static ArrayList<String> stringToTokens(String s) {
    ArrayList<String> al = new ArrayList<String>();
    StringBuilder word = new StringBuilder();
    boolean inWord = !s.isEmpty() && Character.isLetter(s.charAt(0));
    for (int i=0; i<s.length(); i++) {
        char c = s.charAt(i);
        if (Character.isLetter(c)) {
            word.append(c);
        } else if (inWord) {
            if (word.length() > 0) {
                al.add(word.toString());
                word.setLength(0);
            }
            al.add(""+c);
        } else {
            al.add(""+c);
        }
    }
    if (word.length() > 0) {
        al.add(word.toString());
    }
    return al;
}

public static String longestCommonWithWords(String sa, String sb) {
    ArrayList<String> a = stringToTokens(sa);
    ArrayList<String> b = stringToTokens(sb);
    int m[][] = new int[a.size()][b.size()];
    int bestLength = 0;
    HashSet<Integer> bestSet = new HashSet<Integer>();
    for (int i=0; i<a.size(); i++) {
        for (int j=0; j<b.size(); j++) {
            if (a.get(i).equals(b.get(j))) {
                m[i][j] = (i==0 || j==0) ? 1 : m[i-1][j-1]+1;
                if (m[i][j] > bestLength) {
                    bestLength = m[i][j];
                    bestSet.clear();
                    bestSet.add(i);
                } else if (m[i][j] == bestLength) {
                    bestSet.add(i);
                }
            } else {
                m[i][j] = 0;
            }
        }
    }
    // return first result (or empty string if none)
    StringBuilder result = new StringBuilder();
    if (bestLength > 0) {
        int end = bestSet.iterator().next();
        int start = end - bestLength;
        for (int i=start; i<end; i++) {
            result.append(a.get(i+1));
        }
    }
    return result.toString();
}

Tokenization is straightforward (a StringTokenizer would have worked too, but this version handles strange characters better). The LCS algorithm comes straight from the pseudocode in https://en.wikipedia.org/wiki/Longest_common_substring_problem#Pseudocode

tucuxi
  • 17,561
  • 2
  • 43
  • 74
0

What you want to do is split the text into words and then compare full words, not single letters.

Dakkaron
  • 5,930
  • 2
  • 36
  • 51
  • @austinwernli "without splitting words" doesn't mean without splitting the String into words, it means comparing full words. Honestly, it's going to be pretty straightforward to implement by splitting the original string into an array of strings with complete words. – crigore Jul 07 '15 at 15:18
  • @austinwernli: Did you read the question? The OP wants to find the longest common substring without splitting words. Meaning, he doesn't want any partial words in the match. No partial words = only full words. Splitting the string doesn't mean splitting words. More specific, I said `split the text into words` => do not split words, `and then compare full words` => don't compare partial words. – Dakkaron Jul 07 '15 at 15:32
  • maybe @bna should clarify how he defines "i do not want to split the words" rather than others debate over it . – Srinath Ganesh Jul 07 '15 at 15:33
  • @SrinathGanesh I think it's fairly clear from the example given in bna's question what is meant. I agree that "without splitting words" is unfortunate phrasing, but I don't have a better suggestion right now. – ReyCharles Jul 07 '15 at 15:38
  • 1
    Sorry for the confusion: @Dakkaron is correct. No partial words in the answer – bna Jul 07 '15 at 15:55
0

This will help finding the length, I think finding the actual string is also not that difficult.

Create arrays of words with your inputs

String s1 = "hi there how are you" => X[] ={"hi","there","how","are","you"}

String s1 = "hi there how are you" => Y[] ={"hi","there","how","ar"}

Now find LCS on your arrays

     LCS(X, Y, m, n) = LCS(X, Y, m-1, n-1) + 1 if X[m-1].equals(Y[n-1])
                       0  Otherwise (if !X[m-1].equals(Y[n-1]))
Karthik
  • 4,950
  • 6
  • 35
  • 65