Rolling hash overflow/negative result protection

Question

This question is very similar to rolling-hash, but there are some specifics regarding overflow/negative result which are still not clear to me.

I have as well checked out this Rabin-Karp implementation and have issues with the line bellow:

txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;

I understand that the following expression might give negative result:

txtHash - RM*txt.charAt(i-M)

First question:

if we always add Q, a large prime, can this result with negative number due to overflow ?
- If not, why not ? If yes, shouldn't this addition be done only if result is negative ?

Second question:

If, for a moment, we didn't care about negative numbers, would it be correct to write expression bellow ?

txtHash = (txtHash - RM*txt.charAt(i-M)) % Q;

Third question, this part confuses me most:

Lets assume that the overflow cannot happen when we add Q. Why is there left-most % Q operation over the leading digit ?

txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q ) % Q;

I have read the answer i have linked and according to the answer by Aneesh, and if i understood correctly expressions bellow should be similar:

hash = hash - ((5 % p)*(10^2 %p) %p)

txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;

But i don't see why they are similar because in example with hash, % p is not calculated for previous hash value, however for txtHash we calculate % Q over the previous hash as well.

score 2 · Accepted Answer · answered Dec 11 '18 at 18:38

First question:

if we always add Q, a large prime, can this result with negative number due to overflow ? If not, why not ? If yes, shouldn't this addition be done only if result is negative ?

The prime number Q is usually chosen so that 2Q still does not overflow the type.

Now let's see.

txtHash is from 0 to Q - 1.
RM*txt.charAt(i-M) is large.
RM*txt.charAt(i-M) % Q is from 0 to Q - 1.
txtHash - RM*txt.charAt(i-M) % Q is from -(Q - 1) to Q - 1.
txtHash + Q - RM*txt.charAt(i-M) % Q is from 1 to 2Q - 1.

So, as long as 2Q - 1 does not overflow, the above expression is fine.

Second question:

If, for a moment, we didn't care about negative numbers, would it be correct to write expression bellow ?

txtHash = (txtHash - RM*txt.charAt(i-M)) % Q;

Yeah, if the % Q always gave a result from 0 to Q-1 (as it does in Python for example), the above expression would be fine.

Third question, this part confuses me most:

Lets assume that the overflow cannot happen when we add Q. Why is there left-most % Q operation over the leading digit ?

txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q ) % Q;

Suppose we remove the leftmost % Q. Then let us estimate again:

txtHash is from 0 to Q - 1.
RM*txt.charAt(i-M) is large.
How large? From 0 to (Q - 1) * CharCode.
txtHash - RM*txt.charAt(i-M) is from -(Q - 1) * (CharCode - 1) to Q - 1.
txtHash + Q - RM*txt.charAt(i-M) is from -(Q - 1) * (CharCode - 2) to 2Q - 1.

Still possibly negative. Not what we wanted.

Great answer! Since q is generated with 31 bits: BigInteger.probablePrime(31, new Random()).longValue(), this should satisfy the Q condition — John, Dec 11 '18 at 18:46
If you don't mind, could you clarify a few more details? Shouldn't upper bound for the last two expression include charCode as well ? How did we get -2 for the charcode in last expresion ? — John, Dec 11 '18 at 19:04
@John we subtract the large number involving CharCode, so it contributes only to the lower bound. — Gassa, Dec 11 '18 at 19:41

Rolling hash overflow/negative result protection

1 Answers1