0

I am trying to implement a data deduplication program in the cloud using Java.

I'm not sure how to proceed with the implementation.

First, I wanted to do a simple file compare of the file size, date and name of the file. However, this is ineffective since the file might have same content but a different name.

I have decided on a simple algorithm which is file upload -> file chunking -> Rabin-karp hashing -> determine to see whether can upload file.

Will this be fine or are there any improvements?

Where would I be able to find out more information on this? I have tried looking around the Internet but I can't find anything. Most of it is just broken down into certain implementations but without explanation or details on file chunking or Rabin-karp hashing.

I would want to know about which Java libraries I should look into regarding this program.

  • How big are the files? Do you explicitly need to chunk and hash the chunks? Why wouldn't a simple file hash be sufficient, e.g., a `DigestInputStream`? – Dave Newton Jun 05 '19 at 15:34
  • For now, i would want to test with small file sizes maybe like comparing a pdf file that is uploaded with the hash values in the database in the cloud. That is why i am not so sure whether is it necessary to do fingerprinting of the data. – Shadow King Jun 05 '19 at 16:31
  • What is hashing if not fingerprinting? It's really not clear what actual problem you have. – Dave Newton Jun 05 '19 at 16:49
  • I am just not too sure on how to implement a data deduplication in the cloud storage, I am not too sure whether or not to break the file into smaller bytes and get the hash value of each block or i can just get the hash value of the file and compare with the existing hash values in the database, – Shadow King Jun 06 '19 at 07:57
  • Why aren't you sure? In other words, what problem(s) do you see with either solution? Pros/cons? A complete-file hash or even checksum is almost *always* sufficient--that's why they're used to verify file downloads. – Dave Newton Jun 06 '19 at 12:06
  • I am not sure because I am still fairly trying to understand the data deduplication technique. So which means to say that I do not necessarily have to break down a pdf file of maybe like 15 mb into smaller chunks ? I can straight away hash the entire file and compare it with the other hash value of the files that are already stored in the database? Which means that I do not have to do any file chunking process if my files that are to be uploaded are small, yes? – Shadow King Jun 06 '19 at 13:22
  • Correct, although whether you need to chunk at *all* is somewhat up in the air AFAICT. – Dave Newton Jun 06 '19 at 13:57
  • What java libraries would you recommend to study or do some research on regarding this file level hashing? – Shadow King Jun 06 '19 at 14:41
  • https://stackoverflow.com/questions/304268/getting-a-files-md5-checksum-in-java etc. There are a million ways to do it; it doesn't really matter which you use. – Dave Newton Jun 06 '19 at 14:54

1 Answers1

1

It would be easier if you state your problem constraints. Assuming the following:

  • The smallest indivisible unit of data is a file
  • Files are reasonably small to fit in memory for computing hashes
  • Your files are in some cloud bucket or something where you can list them all. Also that eliminates identical filenames.

You can probably narrow down your problem.

  1. Iterate through all the files in all the files using some fast hashing algorithm like a basic CRC checksum and build a map. (Can be easily parallelized).
  2. Filter out all the files which have a collision. You can easily leave out the rest of the files which for all practical purposes should be a pretty reasonable chunk of the data.
  3. Run through this remaining subset of files with a cryptographic hash (or worst case, match the entire files) and identify matches.

This can be refined depending on the underlying data.

However, this is how I would approach the problem and given the structure of it; this problem can be easily partitioned and solved in a parallel manner. Feel free to elaborate more so that we can reach a good solution.

EFreak
  • 3,119
  • 3
  • 27
  • 31
  • So what i am trying to consider is whether to implement a file level deduplication or a block level deduplication. I am not too sure whether just getting the hash value of the file and putting it in a hashmap would be sufficient or do i need to break a file into several smaller bytes and do the deduplication from the blocks. – Shadow King Jun 06 '19 at 07:56
  • What's the range of your file sizes - min, max, avg? What are your block sizes? Again, a lot depends on the requirements. If this is a one time thing. I wouldn't bother. For a batch system which runs on a daily basis, the solution may be quite different. – EFreak Jun 06 '19 at 08:12
  • For now, I would want to implement on a small basis like comparing whether a pdf or a text file uploaded has been duplicated in the cloud database. This implemenation is just for final year project, Honestly, I am a bit blur on how should I go about it. Some solutions suggest to do Rabin-karp hashing and then breaking the file into smaller blocks while others just recommend like checking the existing hash value. I would say this is a one time basis just to prove the concept of data deduplication in the cloud. – Shadow King Jun 06 '19 at 08:25
  • @ShadowKing I'm not sure what you mean by "existing hash value". The only reason I can think of for explicitly breaking it up is if the files are huge and computing the entire checksum/hash would be inefficient. Noting, of course, if you break it up, you have to store the hash/checksum of those chunks somewhere, and they'd need to be indexed. Even then it may not be entirely sufficient (hash collisions while rare, do happen, e.g., if you find a matching hash, it's *likely* you have a duplicate file, but you'd probably want to check at least the next chunks' hashes.) – Dave Newton Jun 06 '19 at 12:08
  • By existing hash value, I mean like the hash values of other files in the database. I am just not too sure on what java libraries i should study for this. I seen some forums stating to study on Rabin-karp algorithm, hashmaps or hash tables, messagedigest. – Shadow King Jun 06 '19 at 13:24