find duplicate 1kb chunks from a file

Question

TL;DR: How to identify duplicate non overlapping 1kb chunks from a file which is large and can be binary file too?
I recently came across this question in one of the challenges.
We are given a file name. This file's size will be in multiples of 1kb. We have to perform dedup operation on this file and write the modified content in another file. The dedup operation finds and removes the duplicate, non-overlapping 1kb chunks from the file. The file can be very large file and it can also be a binary file.
The second part of the question involved reversing the dedup operation and regenerating the original file from the deduped file.

My approach: I tried to use hashing as suggested in Adam Horwath's this blog. I calculated hash of each 1kb byte of data and stored it in a hashtable with hash as key and index of the chunk in consideration as value. here is my code to calculate the hash for 1kb of data (similar to inithash in the blog):

//implement hashing used in Rabin-Karp algorithm 
// sum of p^n * a[x]
//hconst = 69069; //good multiplier for mod 2^32;
 public static long  calculateHash(int [] data, int chunkSize){
    long hash = 1;
    for(int i =0; i < chunkSize; i++)
    {
        int c = data[i];
        hash *= hconst; //multiply with const
        hash += c; //add the byte to hash
    }
    return hash;    
}

There is something wrong (obviously) with my understanding or implementation that it didn't give the correct result. My questions are:

Is the hashing approach correct to identify duplicate chunks?(Comparing each byte is costly process)
Is there any better way to identify duplicate chunks?

Use a proven cryptographic hash. Not a custom method returning a long. And all you need is a Set to know if a hash has already been met. Not a Map. — JB Nizet, Jan 16 '18 at 07:50
The problem is underspecified. Are you supposed to compare only blocks starting at 1k boundaries? What if a repeated block starts at offset 100 within the file, and a similar block appears later in the file at, say, offset 2200 (not an even multiple of 1024 away)? — Jim Garrison, Jan 16 '18 at 07:55
@JBNizet I was using hastable so that while reversing the dedup operation I would know the current chunk is duplicate of which chunk. — shane, Jan 16 '18 at 09:52
@JimGarrison The key here is to consider non-overlapping blocks of size 1kb. eg, if the file is 5kb, then there are 5 chunks of 1kb each, and hash of 5 chunks are 1,2,2,1,3 this means, there are 3 unique chunks and 2 are duplicate chunks. We are supposed to compare blocks at 1kb boundary only. — shane, Jan 16 '18 at 10:06
@JBNizet , can you elaborate more on 'proven cryptographic hash' and how we can implement them? — shane, Jan 16 '18 at 12:33
Remember - having a duplicate hash does not mean the data blocks are duplicates. There are a lot more available values for a 1024-byte data block than there are for an 8-byte hash - there **must** be duplicate hashes. — Andrew Henle, Jan 16 '18 at 13:23
A proven cryptographic hash is a hash that has been proven to be effective and secure for cryptographic operations (like a cryptographic signature). To be effective, such an algorithm must, among other things, make it virtually impossible to come up with two different inputs that generate the same hash. Which is just what you want. Look into the MessageDigest class, and use an algorithm such as SHA256. — JB Nizet, Jan 16 '18 at 14:45

score 0 · Answer 1 · answered Aug 05 '18 at 20:02

Is there a better way than an in-core hash table? Yes. Especially if input files are larger than RAM.

You explained that you have a very large number of 1 KiB documents, a large number of file segments. Pre-process them by reading each segment and writing one line per segment to a temporary segments.txt file containing two columns. First column has a copy of segment contents, or SHA224 hash of contents. Second column has segment index number, which is a serial number starting with zero. Feel free to use just the first few bytes of the hash, depending on your sensitivity to hash collisions.

Now use /usr/bin/sort (out-of-core mergesort) to create segments_sorted.txt. At this point, your problem is trivial. Simply read each line, while remembering previous hash. If cur_hash == prev_hash you have identified duplicate chunks. The associated index lets you rapidly seek() to find the original contents in case ruling out potential collisions is important in your application.

find duplicate 1kb chunks from a file

1 Answers1

Linked