File comparison using checksums

Question

I'm writing a program to find duplicates of files.

I have two folders, in which I have to find duplicates. In the worst case scenario i would have to compare all the files with each other. I was thinking to generate the checksum of each file, compare the checksums and then if the checksums are equal, perform a byte-by-byte check to be ensure the files are exactly the same.

The question is what checksum generator will be fast enough to waste time on it instead of just checking byte-by-byte?

In order to generate a checksum, you'll have to read the file byte-by-byte. And no, the filesystem [doesn't generate checksums](http://stackoverflow.com/questions/1490384/there-is-in-windows-file-systems-a-pre-computed-hash-for-each-file), so you'll have to do that yourself and cache the checksum for each file. — CodeCaster, Oct 31 '13 at 12:35
I took the liberty of rephrasing the question which lacked clarity. — Tarik, Oct 31 '13 at 12:59
Do you know anything at all about the sort of files you're likely to check? E.g. how big do you expect the files to be? — Rob, Oct 31 '13 at 15:57

score 6 · Answer 1 · answered Oct 31 '13 at 13:25

You can reduce the number of comparisons you have to make and also the amount of I/O by getting a full list of the files and then sorting by length. Two files can't be identical if they don't have identical lengths. So you can eliminate a large number of files without doing any I/O other than getting the directory information, which you have to get anyway.

If there are only two files with the same length, X, then you don't have to compute a checksum for those files. Just compare them directly.

If there are three or more files with the same length, then you're better off computing the checksums for all three files, comparing the checksums, and then doing a byte-by-byte comparison if the checksums match.

By the time I fixed my answer I found out that you came out with the file size comparison first. +1. — Tarik, Oct 31 '13 at 18:27

score 2 · Answer 2 · answered Oct 31 '13 at 16:05

2

First of all, group files by length first, as Jim Mischel says.

If the files to compare are large, it might be quicker to calculate your representative (which is all a checksum is) by taking the first n bytes of the file. Reading the entirety of a large file to calculate a checksum to compare it to another file that differs in the first n bytes is inefficient. Theoretically, the 1st n bytes determine the file as uniquely as an n byte checksum. (This is the case if all possible files of a certain length are equally likely)

Of course, if the files to compare are small, it's as quick to read the whole file as a subset of it.

answered Oct 31 '13 at 16:05

Rob

4,327
6
29
55

I like the idea of using just a portion of the file for a checksum. That would eliminate most possible matches and would cut down a lot on the I/O for large files. – Jim Mischel Oct 31 '13 at 16:06
Smart idea! Thanks for the tip! – Tarik Oct 31 '13 at 18:26

score 0 · Answer 3 · edited May 23 '17 at 12:03

Any checksum algorithm will do. You can use MD5 for example. You will hardly waste any time since I/O is way slower than the CPU time spent calculating the checksum. You can also use CRC32.

You said: "I have two folders, in which I have to find duplicates." I would like to clarify something here. If the goal is to find duplicate files, then it does not matter if the files are located in one, two or x number of folders. Assuming you have n files, you need in the order of n log n comparisons to find duplicates. It is indeed useful to read the n files once, calculate their checksums, then do a sort of the checksums in n log n time to find the duplicates. Note however, that you can avoid that by first comparing the file sizes and only resort to checksums when comparing 3 or more files of the same size. That would greatly accelerate your search for duplicates.

File comparison using checksums

3 Answers3