I'm trying to understand why the worst case running time of the Rabin-Karp algorithm is O(nm) and the average case is O(n+m).
Can someone help me with that?
I'm trying to understand why the worst case running time of the Rabin-Karp algorithm is O(nm) and the average case is O(n+m).
Can someone help me with that?
Wiki explains about the time complexity of the algorithm pretty well.
It can be said that the effectiveness (read it as ability to dynamically reuse already calculated hash value in constant time) of the hash computation function is a deciding factor during the calculation of time complexity of the algorithm.
Let us see how the hash computation makes this difference.
Time complexity is O(nm)
for cases when :
call hash(s[1..m]) // O(m) additive
for index from 1 to n-m+1 // O(n)
//Code to check if substring matches
call hash(s[index+1..index+m]) // Inefficient hash function, takes O(m), just like naive string matching
Compared to O(nm)
, additive O(m)
is largely ignored.
Giving, O(m) + O(n)*O(m)
= O(nm)
Time complexity is O(n+m)
for cases when :
call hash(s[1..m]) // O(m) additive
for index from 1 to n-m+1 // O(n)
//Code to check if substring matches
call hash(s[index+1..index+m]) //Efficient hash function which takes only O(1), applies Rolling Hashing
Giving, O(m) + O(n)*O(1)
= O(m) + O(n)
= O(m+n)
Rabin-Karp is worst-case O(nm) because it could find a false positive at every point (of which there are n
) and it can require up to m
comparisons to verify the match, since you need to actually compare the strings.
With an even half-reasonable hash function that shouldn't ever happen, but for pretty well any hash function it is possible to craft a query (that is, both the string and the substring being searched for) which exhibit the above pathological behaviour.
Hence, although R-K has expected time complexity of O(n), it's worst case has time complexity O(nm). (Note: since m
must be no greater than n
, n + m
is bounded by 2n
and thus O(n + m) is the same as O(n).)
It is easier to produce O(nm) behaviour if the problem is to find all the matching substrings, which is another context in which R-K is often used. In that case, searching for a substring consisting of m
as in a string consisting of n
as would definitely take nm
time, since the substring needs to be matched at every point in the source string.
Other algorithms exist for finding all substrings which are still linear in n, even in pathological cases.