How to find all files with the same content?

Question

This is an interview question: "Given a directory with lots of files, find the files that have the same content". I would propose to use a hash function to generate hash values of the file contents and compare only the files with the same hash values. Does it make sense ?

The next question is how to choose the hash function. Would you use SHA-1 for that purpose ?

score 8 · Answer 1 · answered Nov 08 '10 at 13:57

8

I'd rather use the hash as a second step. Sorting the dir by file size first and hashing and comparing only when there are duplicate sizes may improve a lot your search universe in the general case.

answered Nov 08 '10 at 13:57

Dr. belisarius

60,527
15
115
190

Very nice "outside the hash" answer. – Nov 08 '10 at 21:27

Zack Bloom · Accepted Answer · 2010-11-08T21:22:57.597

Like most interview questions, it's more meant to spark a conversation than to have a single answer.

If there are very few files, it may be faster to simply to a byte-by-byte comparison until you reach bytes which do not match (assuming you do). If there are many files, it may be faster to compute hashes, as you won't have to shift around the disk reading in chunks from multiple files. This process may be sped up by grabbing increasingly large chunks of each file, as you progress through the files eliminating potentials. hIt may also be necessary to distribute the problem among multiple servers, if their are enough files.

I would begin with a much faster and simpler hash function than SHA-1. SHA-1 is cryptographically secure, which is not necessarily required in this case. In my informal tests, Adler 32, for example, is 2-3 times faster. You could also use an even weaker presumptive test, than retest any files which match. This decision also depends on the relation between IO bandwidth and CPU power, if you have a more powerful CPU, use a more specific hash to save having to reread files in subsequent tests, if you have faster IO, the rereads may be cheaper than doing expensive hashes unnecessarily.

Another interesting idea would be to use heuristics on the files as you process them to determine the optimal method, based on the files size, computer's speed, and the file's entropy.

True, Adler32 can be a little harder to find, but it is in the zlib library if you are using a language like C or C++ which can link to it, or Python which has a module to interact with the lib. You may also be able to find an executable which does the zlib linkage for you (or reimplements the algorithm). In my experience, MD5 is only marginally faster than SHA1. — Zack Bloom, Nov 08 '10 at 21:27

score 2 · Answer 3 · edited May 23 '17 at 11:53

2

Yes, the proposed approach is reasonable and SHA-1 or MD5 will be enough for that task. Here's a detailed analysis for the very same scenario and here's a question specifically on using MD5. Don't forget you need a hash function as fast as possible.

edited May 23 '17 at 11:53

Community

1
1

answered Nov 08 '10 at 12:31

sharptooth

167,383
100
513
979

score 2 · Answer 4 · answered Nov 08 '10 at 12:32

2

Yes, hashing is the first that comes to mind. For your particular task you need to take the fastest hash function available. Adler32 would work. Collisions are not a problem in your case, so you don't need cryptographically strong function.

answered Nov 08 '10 at 12:32

Eugene Mayevski 'Callback

45,135
8
71
121

How to find all files with the same content?

4 Answers4