Scan duplicate document with md5

Question

for some reasons I can't use MessageDigest.getInstance("MD5"), so I must write the algorithm code in manual way, my project is scan duplicate document (*.doc, *.txt, *.pdf) on Android device. My question is, what must I write before entering the algorithm, to scan the duplicate document on MY ROOT directory on Android device? Without select the directory, when I press button scan, the process begin, the listview show. Is anyone can help me? My project deadline will come. Thank you so much.

public class MD5 {

//What must I write here, so I allow to scan for duplicate document on Android root with MD5 Hash

//MD5 MANUAL ALGORITHM CODE
}

can you share *why* you think you cannot use `MessageDigest` ? — Jan, Dec 17 '15 at 11:41
You cant use MessageDigest, this implementation, or MD5 ? There are other md, like SHA-1, SHAxxx , or even simpler (with more collisions, but perhaps sufficient to detect duplicate documents ...) — guillaume girod-vitouchkina, Dec 17 '15 at 11:45
What do you get when you run `java.security.MessageDigest.getInstance("MD5")`? — Eduardo Santana, Dec 17 '15 at 11:46
If you are not up against a malevolent adversary, there are functions _much_ easier to code and compute with _less_ collisions than MD5 (I'm fond of [Fletcher](https://en.wikipedia.org/wiki/Fletcher%27s_checksum) - use 256 bits if the collision rate of MD5 was barely acceptable). — greybeard, Dec 18 '15 at 08:42
@greybeard, Thanks for the useful information, I would learn it after this project. — UserNG, Dec 18 '15 at 08:56

score 0 · Accepted Answer · edited May 23 '17 at 12:23

0

WHOLE PROCESS:

your goal is to detect (and perhaps store information about) duplicate files.

1 Then, first, you have to iterate through directories and files,

see this:

list all files from directories and subdirectories in Java

2 and for each file, to load it like a byte array

see this:

Reading a binary input stream into a single byte array in Java

3 then compute your MD5 - your project

4 and store this information

Your can use a Set to dectect duplicates (a Set has unique elements).

Set<String> files_hash; // each String is a string representation of MD5
if (files_hash.contains(my_md5)) // you know you have it already

or a

Map<String,String> file_and_hash; // each is file => hash
// you have to iterate to know if you have it already, or keep also a Set

ANSWER for MD5:

read algorithm: https://en.wikipedia.org/wiki/MD5

RFC: https://www.ietf.org/rfc/rfc1321.txt

some googling ...

this presentation, step by step http://infohost.nmt.edu/~sfs/Students/HarleyKozushko/Presentations/MD5.pdf

or try to duplicate C (or java) implementation ...

OVERALL STRATEGY

To keep time and have processus faster, you must also think about the use of your function:

if you use it once, for one unique file, better is to reduce work, by selecting before other files on their size.
if you use it regularly (and want to do it fast), scan regularly new files in background to keep an hash base up to date. Detection of new file is straightforward.
if you want to get all files duplicated, better scan everything, and use Set Strategy also

Hope this helps

edited May 23 '17 at 12:23

Community

1
1

answered Dec 17 '15 at 11:53

guillaume girod-vitouchkina

3,061
1
10
26

Thank you for your advice and some useful link you've share, I was looking for the algorithm for a week, and I found some useful website too, I understand the algorithm, but I don't understand is what must I write before entering the algorithm, to scan the duplicate document. – UserNG Dec 17 '15 at 12:24
1

It's not very smart to compute the MD5 for all files. The first thing you should do is get a list of all files, sort by size, and remove from the list any files that don't have the same size as some other file. Because two files can't be identical if they're different sizes. That greatly reduces the number of files you have to process. *Then* do the MD5 comparisons, but only compare files that have the same lengths. – Jim Mischel Dec 17 '15 at 21:02
@JimMischel , thank for the smart remark, I append some strategy in my answer. – guillaume girod-vitouchkina Dec 18 '15 at 07:30
@All, Nice, thank you so much, I will try in my project :) – UserNG Dec 18 '15 at 08:29
what should i do next to get duplicate images if i get Set files_hash , after that i am unable to proceed further, please help @guillaumegirod-vitouchkina – Mohd Saquib Dec 20 '17 at 13:41
Same images don't return same md5 value. – Mohd Saquib Dec 21 '17 at 11:55

score 0 · Answer 2 · edited May 23 '17 at 12:15

You'll want to recursively scan for files, then, for each file found, calculate its MD5 or whatever and store that hash value, either in a Set<...> if you only want to know if a file is a dupe, or in a Map<..., File> if you want to be able to tell which file the current file is a duplicate of.

For each file's hash, you look into the collection of already known hashes to check if that particular hash value is in it; if it is, you (most likely) have a duplicate file; if it is not, you add the new hash value to the collection and proceed with the next file.

Scan duplicate document with md5

2 Answers2

Linked