How to find out if two binary files are exactly the same

Question

I have got a repository where I store all my image files. I know that there are much images which are duplicated and I want to delete each one of duplicated ones.

I thought if I generate checksum for each image file and rename the file to its checksum, I can easily find out if there are duplicated ones by examining the filename. But the problem is that, I cannot be sure about selecting the checksum algorithm to use. For example, if I generate the checksums using MD5, can I exactly trust if the checksums are the same that means the files are exactly the same?

score 1 · Answer 1 · edited May 23 '17 at 12:17

Judging from the response to a similar question in security forum (https://security.stackexchange.com/a/3145), the collision rate is about 1 collision per 2^64 messages. If your files are differenet and your collection is not huge (i.e. close to this number), md5 can be used safely.

Also, see response to a very similar question here: How many random elements before MD5 produces collisions?

score 1 · Answer 2 · answered Feb 08 '13 at 07:59

1

The chances of getting the same checksum for 2 different files are extremely slim, but can never be absolutely guaranteed (Pigeonhole principle). An indication of how slim may be that GIT uses the SHA-1 checksum for software development source code including Linux and has never caused any known problems so I would say that you are safe. I would use SHA-1 instead of MD5 because it is slightly better if you are really paranoid.

answered Feb 08 '13 at 07:59

neelsg

4,802
5
34
58

1

"it is slightly better if you are really paranoid" you described my attitude great :) Thanks :) – Ugur KAYA Feb 08 '13 at 08:10
SHA-1 is 160bit while MD5 is 128bit. Therefore the result of SHA-1 will be more unique, but the computation takes slightly longer. If you want and even longer hash, you can use something like SHA-256 which is even longer and will be even slower to compute. – neelsg Feb 08 '13 at 08:17

score 1 · Answer 3 · answered Oct 22 '14 at 06:47

To make really sure you best follow a two-step-procedure: first calculate a checksum for every file. If the checksums differ you're sure the files are not identical. If you happen to find some files with equal checksums there's no way around doing a bit-by-bit-comparison to make 100% sure if they are really identical. This holds regardless of the hashing-algorithm used.

What you'll get is a massive time-saving as doing bit-by-bit comparison for every possible pair of files will take forever and a day while comparing a hand full of possible candidates is fairly easy.

How to find out if two binary files are exactly the same

3 Answers3