WHOLE PROCESS:
your goal is to detect (and perhaps store information about) duplicate files.
1 Then, first, you have to iterate through directories and files,
see this:
list all files from directories and subdirectories in Java
2 and for each file, to load it like a byte array
see this:
Reading a binary input stream into a single byte array in Java
3 then compute your MD5 - your project
4 and store this information
Your can use a Set to dectect duplicates (a Set has unique elements).
Set<String> files_hash; // each String is a string representation of MD5
if (files_hash.contains(my_md5)) // you know you have it already
or a
Map<String,String> file_and_hash; // each is file => hash
// you have to iterate to know if you have it already, or keep also a Set
ANSWER for MD5:
read algorithm:
https://en.wikipedia.org/wiki/MD5
RFC: https://www.ietf.org/rfc/rfc1321.txt
some googling ...
this presentation, step by step
http://infohost.nmt.edu/~sfs/Students/HarleyKozushko/Presentations/MD5.pdf
or try to duplicate C (or java) implementation ...
OVERALL STRATEGY
To keep time and have processus faster, you must also think about the use of your function:
if you use it once, for one unique file, better is to reduce work, by selecting before other files on their size.
if you use it regularly (and want to do it fast), scan regularly new files in background to keep an hash base up to date. Detection of new file is straightforward.
if you want to get all files duplicated, better scan everything, and use Set Strategy also
Hope this helps