longest common substring with a rolling hash

Question

I'm implementing the longest common substring problem with a rolling hash logic and I can't understand how it works in O(nlogn) time. The algorithm states that,

1) choose a length of substring by binary search( low, high, mid)
2) Compute rolling hash values for both strings for a given length mid
3) match the two rolling hashes and see if they are the same and reposition the binary search accordingly.

I can see that computing the rolling hashes is O(n) and the binary search for length is O(logn) thus giving O(nlogn), but from what I understand, the rolling hashes by themselves are not sufficient to get a guaranteed hit. Multiple substrings could have the same rolling hash value, so its another times O(n) to check if the strings really match. Am I missing something?

I am not aware with this algorithm, but I guess you are right. there are some type of algorithms named Vegas, where correct result is not 100% granted but gives better performance, and if you want to have 100% correct result you need to do additional checking which in general is named Monte-Carlo version of algrotihm — Arsen Mkrtchyan, Jan 29 '14 at 10:35

score 1 · Accepted Answer · edited May 17 '18 at 17:18

1

It works in O(nlogn) only if you assume that there are no hash-collisions. If you need an algorithm which has the same or better time complexity and perfomance and is guaranteed to produce correct result for all possible inputs, try to use suffix structures(such as suffix array, suffix tree or suffix automaton).

Edit: Here is a nice explanation of Suffix tree algorithm Stack overflow answer to a question on suffix tree. Some other useful resources are here and @makagonov has nice implementation of suffix tree in c++ here

edited May 17 '18 at 17:18

Waqar Rashid

378
3
14

answered Jan 29 '14 at 10:43

kraskevich

18,368
4
33
45

Any resources on suffix automata? – mrk May 25 '14 at 17:43

longest common substring with a rolling hash

1 Answers1