Binary search to find longest common prefix

Question

For an school assignment, we are implementing suffixarray, with the methods of building it and finding the longest common prefix. I manage to build and sort the suffix array quite easily but struggle with the LCP.

I am trying to find the longest common prefix of a pattern string P in another string T, using one singular binary search. The algorithm should return the index of where the longest common prefix begins.

Examples:

If the pattern string P is "racad" and the string T is "abracadabra", the longest common prefix should be "racad", beginning at index 2.

Likewise, if the the pattern string is P "rax" then the longest common prefix should be "ra", beginning at index 2 or 9.

I´ve come quite far but the algorithm is not returning the right value. Here´s my code:

public int compareWithSuffix(int i, String pattern) {
     int c = 0;
     int j = 0;

    while (j < pattern.length() && c == 0) {
        if (i + j <= text.length()) {
        c = pattern.charAt(0 + j) - text.charAt(i + j);
        } else {
            c = 1;
        }
        j++;
    }
    return c;
}

public int binarySearch(String pattern) {
    int left = 0;
    int right = text.length() - 1;
    int mid, c = 0;

    while (c != 0 && left <= right) {
        mid = left + (right - left) / 2;
        c = compareWithSuffix(mid, pattern);

        if (c < 0) {
            right = mid - 1;
        } else if (c > 0) {
            left = mid + 1;
        } else if (c == 0) {
            return mid;
        }
    }
    return left;
}

I run it with this main-method:

public static void main(String[] args) {
    String word = "abracadabra";
    String prefix1 = "rax";
    String prefix2 = "racad";
    SuffixArray s = new SuffixArray(word);

    System.out.println("Longest common prefix of: " + "'" + prefix1 + "'" + " in " + "'" + word + "'" + " begins at index: " + s.binarySearch(prefix1));
    System.out.println("Longest common prefix of: " + "'" + prefix2 + "'" + " in " + "'" + word + "'" + " begins at index: " + s.binarySearch(prefix2));
}

The output is always whatever value I initialize the local variable left with.

The search algorithm must do a singular binary search. I´ve tried searching other stackoverflow-questions and other web-sources but have not found anything helpfull.

Anyone who can see any errors in my code?

Are you actually using your suffix array in the binary search? Because it seems to me like you are not using the entries of the suffix array, only the index `i`. Did you mean `text.charAt(SA[i] + j)` or something like that? — Tobias Ribizel, Dec 16 '17 at 20:14
No I´m not using it. In the beginning, I thought we were supposed to use the suffix array in the binary search, but the professor told us that we use the "original" string (namely abracadabra) and not the suffix array of it. (Which I think is odd - why build a suffix array and then not use it?) `text.charAt(SA[i] + j)` (if SA is the suffix array) would give us a string no? @TobiasRibizel — Isus, Dec 16 '17 at 20:27
@Isus Binary search requires the range you are searching to be sorted. While the text itself is not, the suffixes of the text as listed in the SA are sorted. So you actually don't need to compare the suffix that starts at `mid` with your pattern, but the suffix starting at `SA[mid]`. — Tobias Ribizel, Dec 16 '17 at 20:33
`c` and `left` both equals zero and `right` equals the length of the original text - 1 @AndyTurner — Isus, Dec 16 '17 at 20:34
@Isus I asked what the initial value of the loop condition - `c != 0 && left <= right` - is. — Andy Turner, Dec 16 '17 at 20:35
_Binary search requires the range you are searching to be sorted_ - then it makes even more sense to use the SA. I think I understand. Then I should use the SA in the `compareWithSuffix`-method, not the original text that is used to build the SA @TobiasRibizel — Isus, Dec 17 '17 at 11:47

score 1 · Answer 1 · answered Dec 16 '17 at 20:44

I haven't looked deeply enough to know if this is the only problem in your code, but this immediately jumps out as an explanation for "The output is always whatever value I initialize the local variable left with":

int mid, c = 0;

while (c != 0 && left <= right) {

You set c to zero, and then immediately check if it's not equal to zero. Of course, it's not not equal to zero, so the loop condition is immediately false, thus the loop body is never run. Hence, you will return the initial value of left.

It's not obvious why you are checking c at all. In the only situation where c becomes zero inside the loop, you immediately return. So just change your loop guard to:

while (left <= right) {

(and move the declaration of c inside the loop).

You could easily have found this by stepping through the code with a debugger. I heartily recommend learning how to use one.

score 0 · Answer 2 · answered Dec 19 '17 at 17:08

0

first point: analyzing the examples you gave, it appears that you are not interested in the longest common prefix, but in the longest common substring. A prefix always starts with the first letter of the word - https://en.wikipedia.org/wiki/Prefix

second point: perhaps are you interested in finding the longest common substring of a set of words, or just two words?

public class Main3 {

/*
same functionallity as compareWithSuffix, but i think this name
is more suggestive; also, no need to pass i (the starting index) as a
parameter; i will later apply substring(i) to text      
        */
public String longestCommonPrefix(String text, String pattern)
{
    String commonPrefix="";
    for(int j=0; j<text.length() & j<pattern.length(); j++)
    {
        if(text.charAt(j)==pattern.charAt(j))
        {
            commonPrefix=commonPrefix+text.charAt(j);
        }
        else
        {
            break;
        }
    }

    return commonPrefix;
    //for params "abc", "abd", output will be "ab"
}

public String longestCommonSequence(String s1, String s2)
{
    /*
    take for example "xab" and "yab";in order to find the common
    sequence 'ab", we need to chop both x and y; for this reason
    i apply substring to both strings, cutting progressivelly their first letters       
            */
    String longestCommonSequence="";
    for(int i=0;i<=s1.length()-1;i++)
    {
        for(int j=0;j<=s2.length()-1;j++)
        {
            String s1_temp=s1.substring(i);
            String s2_temp=s2.substring(j);
            String commonSequence_temp=longestCommonPrefix(s1_temp, s2_temp);
            if(commonSequence_temp.length()>longestCommonSequence.length())
                longestCommonSequence=commonSequence_temp;
        }
    }

    return longestCommonSequence; 
}



public static void main(String args[])
{
    Main3 m = new Main3();
    String common = m.longestCommonSequence("111abcd2222", "33abcd444");
    System.out.println(common);//"abcd"

}
}

answered Dec 19 '17 at 17:08

Newton fan 01

484
5
14

_First point:_ Then I have missunderstood the meaning of prefix. It is as you say that I want to find the longest possible common substring. _Second point:_ just two words. The search has to be done by a binary search, and the algorithm should return the position (in the word we search in) where the longest common substring begins @Newton fan 01 – Isus Dec 20 '17 at 10:48
i am just now editing a second answer, providing a solution for more than just two words. even if your current problem doesn't ask for this, perhaps other readers will find the solution interesting – Newton fan 01 Dec 20 '17 at 10:51
just as someone else pointed out, binary search makes sense only for sorted stuff. when searching a pattern inside a string, binary search simply does not make sense – Newton fan 01 Dec 20 '17 at 10:53
The binary searh should be done by using a suffix array (sorted) constructed from the string T. Otherwise it would not make sense if we are just using two strings and nothing else, as you say. – Isus Dec 20 '17 at 10:55
if you need to return the **position** of the common substring,and not the common substring **itself**, then use my algorithm to find that common substring, and later search it inside the larger string: https://stackoverflow.com/questions/2615749/java-method-to-get-position-of-a-match-in-a-string – Newton fan 01 Dec 20 '17 at 10:56
as long as my solution does find the longest common substring, is it the solution you were looking for? or is it mandatory that you use suffix-array data structure? a second point: you wanted to find "common prefix of a pattern string P in another string T" - i don't really understand what you mean. Is it the same thing as finding the common prefix of strings P and T? – Newton fan 01 Dec 20 '17 at 11:14
I wish that would suffice, but no, it´s unfortunately not enough. We **must** use a suffix-array structure **and** use a one singular binary search to find the position of the longest common substring. Otherwise the time complexity won´t be efficient enough, according to our teacher. – Isus Dec 20 '17 at 12:32
Regarding your second point: it is very similair, we have a longer string, let´s say "abracadabra" (we call string T), and we want to find out if a smaller string, let´s say "radac" (we call string P) exists in the longer string T. If it does, we want to return the position of where "radac" begins in the longer string "abracadabra". Does this clarify what I mean? – Isus Dec 20 '17 at 12:36

score 0 · Answer 3 · answered Dec 20 '17 at 10:59

I am providing here a different answer, because the approach is totally different, and leads to a generalized solution. (find the common substring of an entire list of strings)

For each word, i build all its possible substrings. A substring is determined by its start and end index. If the length of a word is L, the start index can be: 0, 1, 2,... L-1; if the starting index is 0, the end index can take values from 1 to L-1, thus L-1 values; if the starting index is 1, there are L-2 possible values for the end index. Thus, a word of length L has (L-1) +(L-2) + ... +1 = L*(L-1)/2 substrings. This gives square complexity in regard to L, but that's not an issue, because words rarely exceed 15 letters in length. If the string is not a word, but a text paragraph, then we have a problem with the square complexity.

Next, after i have build the set of substrings for every word, i build the intersection of these sets. The main idea is that a common substring of more words is, in the first place, a substring inside every such word, and, moreover, a substring that is encountered in all of these words. This lead to the idea of building the set of substrings for every word, and then do the intersection.

After we have found all the common substrings, just iterate and keep the longest one

import java.util.ArrayList;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;

public class Main4 {

    HashSet<String> allSubstrings(String input)
    {
        HashSet<String> result = new HashSet<String>();
        for(int i=0;i<=input.length()-1;i++)
            for(int j=i+1;j<=input.length();j++)
                result.add(input.substring(i,j));

        return result;
    }

    public HashSet<String> allCommonSubstrings(ArrayList<String> listOfStrings)
    {

        ArrayList<HashSet<String>> listOfSetsOfSubstrings =new ArrayList<HashSet<String>>();
        //for each string in the list, build the set of all its possible substrings
        for(int i=0;i<listOfStrings.size();i++)
        {
            String currentString = listOfStrings.get(i);
            HashSet<String> allSubstrings = allSubstrings(currentString);
            listOfSetsOfSubstrings.add(allSubstrings);
        }

        //find the intersection of all the sets of substrings
        HashSet<String> intersection = new HashSet<String>(listOfSetsOfSubstrings.get(0));
        for(int i=0;i<listOfSetsOfSubstrings.size();i++)
        {
            HashSet<String> currentSet=listOfSetsOfSubstrings.get(i);
            intersection.retainAll(currentSet);
            //retainAll does the set intersection. see: https://stackoverflow.com/questions/8882097/how-to-calculate-the-intersection-of-two-sets

        }

        return intersection;

    }

    public String longestCommonSubstring(HashSet<String> setOfSubstrings)
    {
        if(setOfSubstrings.size()==0)
            return null;//if there are no common substrings, then there is no longest common substrings

        String result="";
        Iterator<String> it = setOfSubstrings.iterator();
        while(it.hasNext())
        {
            String current = it.next();
            if(current.length()>result.length())
                result=current;
        }

        return result;
    }

    public static void main(String[] args)
    {
        Main4 m = new Main4();
        ArrayList<String> list=new ArrayList<String>();
        list.add("bbbaaddd1");
        list.add("bbbaaccc1");
        list.add("dddaaccc1");
        HashSet<String> hset = m.allCommonSubstrings(list);
        Iterator<String> it = hset.iterator();
        System.out.println("all coommon substrings:");
        while(it.hasNext())
        {
            System.out.println(it.next());
        }
        System.out.println("longest common substring:");
        String lcs=m.longestCommonSubstring(hset);
        System.out.println(lcs);
    }
}

output:

all coommon substrings:
aa
a
1
longest common substring:
aa

Binary search to find longest common prefix

3 Answers3