I am trying to implement a data deduplication program in the cloud using Java.
I'm not sure how to proceed with the implementation.
First, I wanted to do a simple file compare of the file size, date and name of the file. However, this is ineffective since the file might have same content but a different name.
I have decided on a simple algorithm which is file upload -> file chunking -> Rabin-karp hashing -> determine to see whether can upload file.
Will this be fine or are there any improvements?
Where would I be able to find out more information on this? I have tried looking around the Internet but I can't find anything. Most of it is just broken down into certain implementations but without explanation or details on file chunking or Rabin-karp hashing.
I would want to know about which Java libraries I should look into regarding this program.