Relationship between Rabin-Karp Algorithm and Deduplication

Question

Many deduplication libraries or applications applies Rabin Karp rolling hash algorithm for fast hashing to cut a chunk from the file binary.
My question is, why would Rabin Karp algorithm often used for cutting the chunk?
I know it is fast rolling hash algorithm, but my question is more fundamental.
There would be many ways to cut a chunk.
For instance, comparing one byte (without mod operation) with value to cut a chunk would result in 256 byte chunk on average.
Comparing 9 bit would result in 512 Byte chunk on average etc.
Wouldn't just comparing last few bits without hashing result similar to rolling hash algorithm such as Rabin Karp but faster?

score 0 · Accepted Answer · answered May 16 '17 at 07:51

For variable sized chunking dedup, we have two steps:

Chunking
Indexing

Rabin Karp rolling hash is a chunking algorithm which cut file into different size pieces. Then we need to index/query data chunks since we do dedupe. The general method is calculate hash value of chunk and store the chunk with the hash as the key. With Rabin Karp algorithm, all is straight forward since we get the hash and data chunk at the same time.

You mentioned method which compare last few bits can help to cut file into pieces, but if you want to indexing these chunks, how can we get the key? So you have to calculate the hash.

Hope this will help.

Relationship between Rabin-Karp Algorithm and Deduplication

1 Answers1