4

I want to find duplicate files on the file system in C++. Is there any algorithm to do that as fast as possible? And do I need to create a multi-threaded application, or I can just use one thread to do it?

Eric Leschinski
  • 146,994
  • 96
  • 417
  • 335
unresolved_external
  • 1,930
  • 5
  • 30
  • 65
  • 1
    Usually only 1 thread does the IO, since the disk access cannot be simultaneous for HDD (not sure about SSD). To find duplicate, compare hash, usually MD5 or SHA1. – nhahtdh Aug 01 '12 at 13:36
  • Related question: http://stackoverflow.com/questions/784585/what-is-the-fastest-way-to-check-if-files-are-identical – Andriy Aug 01 '12 at 13:37
  • Can you elaborate a bit more? Do you want to search an entire disk to see if any file in any directory is a duplicate of any other file on the drive? Do you need help with the filesystem access, the algorithm, or both? – Mark B Aug 01 '12 at 13:38
  • Do you want to search an entire disk to see if any file in any directory is a duplicate of any other file on the drive? Correct. Do you need help with the filesystem access, the algorithm, or both? I need help with the algorithm, but if you can help me with filesystem access I would be appreciated :) – unresolved_external Aug 01 '12 at 13:41

3 Answers3

11

I concur with Kerrek SB that there are better tools for this than C++, however, assuming you really need to do this in C++, here are some suggestions and things to consider in your implementation:

  1. use boost::filesystem for portable filesystem traversal

  2. the hash every file suggestion is very reasonable, but it might be more efficient to first make a multimap where the file size is the key. Then only apply the hash when there are files of duplicate size.

  3. decide how you want to treat empty files and symbolic links/short cuts

  4. decied how you want to treat special files, e.g. on unix you have directories fifos, sockets etc

  5. account for the fact that files or directory structure may change, disappear or move while your algorithm is running

  6. account for the fact that some files or directories may be inaccessible or broken (e.g. recursive directory links)

  7. Make the number of threads configurable as the amount of parallelization that makes sense depends on the underlying disk hardware and configuration. It will be different if you are on a simple hard drive vs an expensive san. Don't make assumptions, though; Test it out. For instance, Linux is very good about caching files so many of your reads will come from memory, and thus not block on i/o.

frankc
  • 11,290
  • 4
  • 32
  • 49
  • +1 - going by file size first is definitely a better approach! – Kerrek SB Aug 01 '12 at 13:55
  • what is the best way to find hash of the file, via OpenSSL library? Can function of finding hash of the file be applied to the all files or only to the txt files ? – unresolved_external Aug 02 '12 at 19:44
  • Yes, that is probably fine. You can see here for how to compute md5sums: http://stackoverflow.com/questions/1220046/in-c-how-to-get-md5-hash-of-a-file You don't need a cryptographically secure hash for this application, unless you someone intentionally fooling your scanner will cause security problems (like if you are automatically deleting duplicates etc) – frankc Aug 03 '12 at 14:02
9

1) Don't use C++. All the tools you need already exist.

2) Hash every file (e.g. with md5sum) and build an index of filenames, file sizes and hash values.*

3) Sort by hash value and look for duplicate pairs of hash value and size (e.g. with sort).

4) Do an ordinary diff on the candidate duplicates.

You can parallelize step 2) with a bit of work, but you'll be limited by the I/O speed of your storage. You can parallelize step 3) by splitting your large index file into bits, sorting them separately and then merging them (sort -m).

*) As @frankc says, don't actually hash every file, but only those whose sizes aren't unique. Start with the size-based index. You'll have to hash a lot of small files, but only very few large files.

Kerrek SB
  • 464,522
  • 92
  • 875
  • 1,084
  • I would like dont use it, but is is a part of the task ;). So I need one single thread, am I correct? – unresolved_external Aug 01 '12 at 13:38
  • Not sure how step 2 can be paralellized. Most of the time is spent on IO anyway. – nhahtdh Aug 01 '12 at 13:39
  • And by the way, can I hash non-text files( e.g. *.avi or *.mp3 ) ? – unresolved_external Aug 01 '12 at 13:39
  • @nhahtdh: Yes, you're right, I had wanted to mention that I/O will be the bottleneck in any case. – Kerrek SB Aug 01 '12 at 13:41
  • 1
    "Sort by" ehmm... simply do the indexing by the hash value (or hash value + file size) (eg: dictionary of lists), so you don't have to sort. – Karoly Horvath Aug 01 '12 at 13:55
  • 1
    As @frankc says, you should probably make an index of file *sizes* first and only compute the hash for those candidates. – Kerrek SB Aug 01 '12 at 13:56
  • @nhahtdh yes IO is a bottleneck but don't assume the operations are totaly sequential. a lot i/o will come from disk cache on linux, plus the underlying disk setup my have raid, mutliple controllers etc. – frankc Aug 01 '12 at 18:03
4

I would do this:

  • scan the directories you are interested in, looking at each file's size; store the pair file size/path in a multimap, with the file size as index;
  • then, scan the multimap for buckets with just one element per key, i.e. files whose size is unique; those surely cannot be duplicates.
  • hash the content of the remaining files, and do the same stuff as before (multimap with hashes as keys and paths as values).
  • then perform a real (byte per byte) comparison only of the files that have the same hash.

This process should be much faster than blindly hashing all the files, since most files have different sizes and can be told apart just by looking at that; and checking the file size is much cheaper than hashing files, since it's just a filesystem attribute lookup instead of reading the whole content of the file.

The final step is needed because there's the possibility of different files with the same hash; but with good hashing functions the most of the work has already been done, since hash collisions for unrelated files should be really rare.

Notice that there's no need for your hash function to be cryptographically secure, neither particularly fast (I suppose that the time of this process would be dominated by IO).

Also, since you don't actually need to have a sorted container, instead of the multimap you could use an unordered_multimap, since it should have faster lookup times and, once you know how many files you have to deal with, you may call reserve with a definite maximum number of elements, avoiding reallocations.

Matteo Italia
  • 123,740
  • 17
  • 206
  • 299