substring calculation in a string

Question

I am having difficulty finding better approach than O(n^2) for the following question.

I am given a string e.g xyxxz.

Now I need to find total number of matching characters in each prefix of the given string.

Here, possible prefixes of string are:

 xyxxz : matching characters is 5
 yxxz  : matching characters is 0 (since 1st character doesnt match)
 xxz   : matching characters is 1
 xz    : matching characters is 1
 z     : matching characters is 0

This should be the output. I did the following code:

cin>>str;
len=str.length();
for(i=0;i<len;i++){
    sum=0;
    k=i;
    for(int j=0;j<len;j++)
    {
        if(str[k] == str[j]){
           sum++;
           k++;
        }
        else
            break;
    }
    cout<<sum<<"  ";  //I get the output 5 0 1 1 0
}

But it's O(n^2). I want a better approach : probably O(n) or O(nlogn).

Thanks in advance.

Pointing out `std::mismatch`, which, IIRC, is linear complexity. — chris, Jul 26 '12 at 04:20
If you want it, [this page](http://en.cppreference.com/w/cpp/algorithm/mismatch) has a possible implementation. That's a good place to start as to where you can improve. — chris, Jul 26 '12 at 04:25
Not that this answers your question, but I don't think you need the `else break;` lines. — Turix, Jul 26 '12 at 04:27
mismatch takes O(n)..And applying mismatch for each possible prefix as aforementioned,it would again be O(n^2).won't it ? — vijay, Jul 26 '12 at 04:32
@Turix it's required because I don't want to traverse next characters until all previous characters match. — vijay, Jul 26 '12 at 04:33
@vjshah, Ah, I didn't read it quite correctly, but I think N^2 is the best you're going to get. The first comment is still valid in saying that you can use it to achieve your goal in less, very-well-tested code. — chris, Jul 26 '12 at 04:40
@vjshah Sorry if I misunderstand, but when you say "prefix", you mean "suffix", is that correct? — jogojapan, Jul 26 '12 at 05:06
@vj_shah ah ha. I see you fixed the brackets around it. Now I get it. — Turix, Jul 26 '12 at 05:06

score 2 · Answer 1 · edited May 23 '17 at 11:48

This can be done in linear time using the following procedure:

Construct the suffix array SA as well as the LCP array (longest-common prefix array). The suffix array is the lexicographically sorted list of all suffixes of the string (similar to the list you gave in the example, but lexicographically sorted. Note that each suffix is represented by its starting position in the original string, i.e. one integer per suffix). The LCP is an also an array of integers, identical in length to the suffix array. At each position i>0, LCP[i] is the longest prefix the ith suffix of the suffix array has in common with the (i-1)th suffix; and we set LCP[0]:=0.

The suffix array can be constructed in linear time using the Skew algorithm (also known as DC algorithm), and it is possible to construct the LCP array alongside the suffix array, still in O(n) time. See an SO post on state-of-the-art suffix array algorithms for further ideas and implementations.
Identify the position of the complete string within the suffix array (e.g. by linearly scanning the suffix array for an entry that contains the integer 0).
Starting from that position walk leftwards and rightwards along the LCP array to identify the longest prefix that each suffix has in common with the complete string. I have described this procedure in detail in this older SO post.

Remark Although this requires no more than O(n) memory and time and is therefore theoretically optimal, it is a very complicated procedure and will only be useful in practice if your string is extremely long.

yeah ..it seems pretty complicated.I do have string length of 1 million. Thanks for your help :) — vijay, Jul 26 '12 at 07:34

Vaughn Cato · Answer 2 · 2012-07-26T05:22:26.890

1

If you build a suffix tree for your string, you can walk the suffix tree to find the lengths of the matches. Any child of the root node that doesn't match the first character will have a value of zero, as will all its descendants. Then all descendants of the child of the root node that does have a match will at least have a match equal to the length of that match for that edge.

edited Jul 26 '12 at 05:22

answered Jul 26 '12 at 04:44

Vaughn Cato

63,448
5
82
132

this is what would help.I need to traverse the suffix tree and get total number of edges to each prefix if I am getting your idea.It would take O(nlogn).Please correct me if I am getting wrong.Thanks – vijay Jul 26 '12 at 04:48
@vjshah: yes, that sounds right. It may be possible to do better than O(n*log(n)), since the tree can be built in O(n) time, but I'm not sure. – Vaughn Cato Jul 26 '12 at 04:52

score 1 · Answer 3 · answered Mar 15 '13 at 05:16

1

Use Suffix array. Suffix array DC3 will do it in O(N) time, where N is the number of characters in original String.

answered Mar 15 '13 at 05:16

linsek · Answer 4 · 2012-07-26T04:48:03.640

int strcoll (register const char *s1, register const char *s2)
{
    while (*s1 == *s2++) {
        if (*s1++ == '\0') {
            return 0;
        }
    }
    return *s1 - *--s2;
}

This function starts comparing the first character of each string. If they are equal to each other continues with the following pair until the characters differ or until a null-character signaling the end of a string is reached.

Returns an integral value indicating the relationship between the strings: A zero value indicates that both strings are equal. A value greater than zero indicates that the first character that does not match has a greater value in str1 than in str2; And a value less than zero indicates the opposite.

But it won't give better solution than O(n^2).You have to do strcoll using string and each prefix of string.So for getting each prefix and then strcoll,it would be O(n^2).samething can be done with strcmp or str.compare(). — vijay, Jul 26 '12 at 04:45

score 0 · Answer 5 · answered Jul 26 '12 at 05:14

0

I am not good at calculating the O value.. but here is another implementation for what you need. I think it is a little more efficient.

cin >> str;

len=str.length();
for(i=0;i<len;i++)
{
    sum=0;
    k=0;

    while(str[k + sum] == str[i + sum])
    {
            sum++;
    }
    cout << sum ;

}

answered Jul 26 '12 at 05:14

ddb

1,416
11
17

while is an infinite loop.And it's again O(n^2) – vijay Jul 26 '12 at 07:20

substring calculation in a string

5 Answers5