I recently came across the KMP algorithm, and I have spent a lot of time trying to understand why it works. While I do understand the basic functionality now, I simply fail to understand the runtime computations.
I have taken the below code from the geeksForGeeks site: https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/
This site claims that if the text size is O(n) and pattern size is O(m), then KMP computes a match in max O(n) time. It also states that the LPS array can be computed in O(m) time.
// C++ program for implementation of KMP pattern searching
// algorithm
#include <bits/stdc++.h>
void computeLPSArray(char* pat, int M, int* lps);
// Prints occurrences of txt[] in pat[]
void KMPSearch(char* pat, char* txt)
{
int M = strlen(pat);
int N = strlen(txt);
// create lps[] that will hold the longest prefix suffix
// values for pattern
int lps[M];
// Preprocess the pattern (calculate lps[] array)
computeLPSArray(pat, M, lps);
int i = 0; // index for txt[]
int j = 0; // index for pat[]
while (i < N) {
if (pat[j] == txt[i]) {
j++;
i++;
}
if (j == M) {
printf("Found pattern at index %d ", i - j);
j = lps[j - 1];
}
// mismatch after j matches
else if (i < N && pat[j] != txt[i]) {
// Do not match lps[0..lps[j-1]] characters,
// they will match anyway
if (j != 0)
j = lps[j - 1];
else
i = i + 1;
}
}
}
// Fills lps[] for given patttern pat[0..M-1]
void computeLPSArray(char* pat, int M, int* lps)
{
// length of the previous longest prefix suffix
int len = 0;
lps[0] = 0; // lps[0] is always 0
// the loop calculates lps[i] for i = 1 to M-1
int i = 1;
while (i < M) {
if (pat[i] == pat[len]) {
len++;
lps[i] = len;
i++;
}
else // (pat[i] != pat[len])
{
// This is tricky. Consider the example.
// AAACAAAA and i = 7. The idea is similar
// to search step.
if (len != 0) {
len = lps[len - 1];
// Also, note that we do not increment
// i here
}
else // if (len == 0)
{
lps[i] = 0;
i++;
}
}
}
}
// Driver program to test above function
int main()
{
char txt[] = "ABABDABACDABABCABAB";
char pat[] = "ABABCABAB";
KMPSearch(pat, txt);
return 0;
}
I am really confused why that is the case.
For LPS computation, consider: aaaaacaaac In this case, when we try to compute LPS for the first c, we would keep going back until we hit LPS[0], which is 0 and stop. So, essentially, we would travel back atleast the length of the pattern until that point. If this happens multiple times, how will time complexity be O(m)?
I have similar confusion on runtime of KMP to be O(n).
I have read other threads in stack overflow before posting, and also various other sites on the topic. I am still very confused. I would really appreciate if someone can help me understand the best and worse case scenarios for these algorithms and how their runtime is computed using some examples. Again, please don't suggest I google this, I have done it, spent a whole week trying to gain any insight, and failed.