TL;DR: How to identify duplicate non overlapping 1kb chunks from a file which is large and can be binary file too?
I recently came across this question in one of the challenges.
We are given a file name. This file's size will be in multiples of 1kb. We have to perform dedup operation on this file and write the modified content in another file. The dedup operation finds and removes the duplicate, non-overlapping 1kb chunks from the file. The file can be very large file and it can also be a binary file.
The second part of the question involved reversing the dedup operation and regenerating the original file from the deduped file.
My approach: I tried to use hashing as suggested in Adam Horwath's this blog. I calculated hash of each 1kb byte of data and stored it in a hashtable with hash as key and index of the chunk in consideration as value. here is my code to calculate the hash for 1kb of data (similar to inithash in the blog):
//implement hashing used in Rabin-Karp algorithm
// sum of p^n * a[x]
//hconst = 69069; //good multiplier for mod 2^32;
public static long calculateHash(int [] data, int chunkSize){
long hash = 1;
for(int i =0; i < chunkSize; i++)
{
int c = data[i];
hash *= hconst; //multiply with const
hash += c; //add the byte to hash
}
return hash;
}
There is something wrong (obviously) with my understanding or implementation that it didn't give the correct result. My questions are:
- Is the hashing approach correct to identify duplicate chunks?(Comparing each byte is costly process)
- Is there any better way to identify duplicate chunks?