1

I have Rabin Karp implementation in C++ (Rabin Karp is a string pattern matching algorithm that uses hashing technique to match substrings [Wiki link to the algorithm] (https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm)), A trivial test case fails, I suspect precomputation of substring hashes routine is doing something wrong, because I tried replacing the precomputed value with the PolynomialHash function call ie. computing the hash of the substring in flight. I get the correct result with PolynomialHash function call but precomputed doesn't. What might have I done wrong here?

Algorithm pseudo code

1) Choose a large Prime p. 
2) Let x be a random value in range [1, p-1]. 
3) Get the hash value for the pattern (p and x are used here).
4) For each substring which has length equal to the pattern compute the hash value of this substring. 
5) If the hash values of pattern and the substring match then this substring could be a potential match, to ensure, check whether they are in fact the same. 

The reason a good implementation should avoid using PolynomialHash is because the time complexity goes high, instead precomputing the Hash values for each substring is more efficient. So, my problem is the precomputed hashes, and the function call hash values don't match. Below is the relevant code and the failing testcase. Note the large prime I used is 1e9 + 7. ll is typedef for long long int.

The equation used for precomputation

bool AreEqual(const string &text,const string &pattern,int si,int ei){
    for(int i =0;i<pattern.size();i++){
        if(pattern[i]!=text[si+i]) return false; 
    }
    return true;
}

ll PolynomialHash(const string &str,ll prime,ll x,int si,int ei){
    ll hash_val = 0; 
    for(int i = ei;i>=si;i--) hash_val = (hash_val*x+str[i])%prime;
    return hash_val;
} 


ll GetInt(const char &x){
    return x-'a';
}

void PrecomputeSubstringHashes(vector<ll> &substring_hashes,const string &text,const string &pattern,ll prime,ll x){
    int text_len = text.length(); 
    int pattern_len = pattern.length(); 
    substring_hashes[text_len-pattern_len] = PolynomialHash(text,prime,x,text_len-pattern_len,text_len-1); 
    ll y = 1;
    for(int i =0;i<pattern_len;i++) y = (y*x)%prime; // can still be optimized to O(log(|P|)) 
    for(int i = text_len-pattern_len-1;i>=0;i--){
        substring_hashes[i] = (prime + substring_hashes[i+1]*x + GetInt(text[i]) - y*GetInt(text[i+pattern_len]))%prime;
    }
}

void RabinKarpSubstringSearch(const string &text,const string &pattern){
    vector<int>positions; 
    ll prime = Mod; // This large prime used is 1e9 + 7
    ll x = 1 + rand()%prime; 
    ll pattern_hash = PolynomialHash(pattern,prime,x,0,pattern.length()-1);  
    assert(text.length()>=pattern.length());
    vector<ll>substring_hashes(text.length()-pattern.length()+1,0);  
    PrecomputeSubstringHashes(substring_hashes,text,pattern,prime,x);
    for(int i =0;i<=text.length()-pattern.length();i++){
        ll substr_hash = substring_hashes[i]; // Try with PolynomialHash it gives correct results
        if(substr_hash!=pattern_hash) continue;  
       // cout<<"Hitting at "<<i<<' '; // checking if expected and more indices [the wrong ones] hit this line.
        if(AreEqual(text,pattern,i,i+pattern.length()-1)) positions.push_back(i); /// The probablity of this condition being hit is too less thanks to a large prime number.
    }  
    cout<<'\n';
    for(const int &x : positions) cout<<x<<' ';
}
// Test case 

aab 
abxaabbbaaab 

PolynomialHash gives [3,9] as expected 
substring_hashes[i] gives only [9] 

I Did some further analysis I was right to suspect the Hash returned by Function and the precomputed values don't match. Here are both of them with input same as in the testcase. The Pattern Hash value is 850147133 seen at index 3 and 9 in Function hash.

// Function computed Hashes [index,HashValue]
0 45760133
1 921924329
2 423268806
3 **850147133**
4 654436503
5 654436504
6 227558154
7 423268784
8 423268783
9 **850147133**
// Precomputed Hashes [index,HashValue] 
0 317287043
1 159209231
2 90628189
3 807191256
4 327387316
5 934623271
6 740231484
7 428758105
8 399604110
9 **850147133**
Pawan Nirpal
  • 565
  • 1
  • 12
  • 29
  • `ll` -- What is this? – PaulMcKenzie Dec 07 '22 at 08:31
  • *I Did some further analysis I was right to suspect the Hash returned by Function and the precomputed values don't match.* -- [What is a debugger?](https://stackoverflow.com/questions/25385173/what-is-a-debugger-and-how-can-it-help-me-diagnose-problems) – PaulMcKenzie Dec 07 '22 at 08:35
  • @PaulMcKenzie Hi I see that it has been mentioned in the desc, sorry for the unreadablity of the code, It is a typedef for long long int. – Pawan Nirpal Dec 07 '22 at 08:44
  • @PaulMcKenzie Thanks for the suggestion I'm aware of debuggers, I wanted to know any logical flaw or modulo maths could have missed! – Pawan Nirpal Dec 07 '22 at 08:46
  • 1
    Guessing you've bumped into C++'s non-Euclidean mod operator, like many before you. If I'm right, changing `- y*GetInt(...)` to `+ (prime-y)*GetInt(...)` should fix it. – David Eisenstat Dec 07 '22 at 13:32
  • @DavidEisenstat That did it, thanks a lot. I have never heard of non-Euclidean mod operator, will learn about it, thanks for pointing out. – Pawan Nirpal Dec 08 '22 at 05:49

0 Answers0