2

I want to write a program for personal use that walks the file tree of all of my volumes for the purpose of finding duplicate files. I know there are programs out there that do this, but none do it the way I want to do it, and few seem to ever employ file hashing as a check for accuracy. Probably because hashing takes time.

While I walk the file trees, I will be storing three pieces of information in a mySQL database, which will be:

  • Full file path
  • File Size
  • Hash Signature

Because for my purposes, a file will be considered a duplicate if all of these conditions are met:

  • The file name is the same
  • The file size is the same
  • The hash signature is the same.

Given the first two conditions being true, condition three does NOT need to be incredibly accurate in terms of hashing algorithms.

Once the tree walks are all done, I will then search the database for matching file hashes and then check the other conditions...

I know that MD5 seems to be the 'defacto-standard' for generating unique file hash signatures, but it is costly as far as time goes, and in my project, I will be generating a hash signature for millions of files and don't want to wait several days for the process to finish.

So based on my requirements, what would be the fastest way to generate a file hash signature in Java that would be good enough to use as a final validation that the two files are indeed duplicates?

Thank you

Update: After some thought and the discussion below, I've decided to slightly alter my method so that I only perform a deeper comparison between files after the first two conditions are met. Meaning I'll walk the tree and create the database entries, then do the deeper computations once the filename and the size are equal, and I'll be exploring a checksum method as opposed to hashing.

Michael Sims
  • 2,360
  • 1
  • 16
  • 29
  • When you say the hash doesn't have to be accurate, do you mean you can tolerate false positives or false negatives? – shmosel Jun 21 '22 at 20:15
  • have you checked this: https://stackoverflow.com/questions/5632604/fastest-hash-function-implemented-in-java-comparing-part-of-file they use a checksum – wake-0 Jun 21 '22 at 20:16
  • Why not hash only files that turn out to be having the same name and same size? It might save a lot of time if there aren't many duplicates to begin with. – akarnokd Jun 21 '22 at 20:18
  • @shmosel - My thinking in terms of "accuracy" was that an algorithm like MD5 strives to produce a truly unique signature for any given file, which I assume makes that method more costly to use in terms of clock cycles.. and given that in my case, the filename and the file size being the same, I assume that the same level of "accuracy" in the hash signature would not require extreme detail - but as I write this, my thinking might be incorrect since the files could be very similar, perhaps that kind of detail would be necessary... Im not entirely sure about this. – Michael Sims Jun 21 '22 at 21:00
  • @akarnokd - It's funny that you made this comment because literally two seconds after clicking submit on this question, that thought occude to me to just record the tree in the database without hashing, then only hash the files when the first two conditions are met ... it still would benefit from the fastest method, however when it comes to hashing the files. – Michael Sims Jun 21 '22 at 21:01
  • I don't know that there's a correlation between hash uniqueness and processing time. – shmosel Jun 21 '22 at 21:03
  • @KevinWallis - I like the checksum idea ... I think I'm going to explore that method. – Michael Sims Jun 21 '22 at 21:07

1 Answers1

1

I have recently been researching a similar problem and ended up with a similar set of conditions. I decided to try MurmurHash3 as it seems purpose built for this application. It is not cryptographically secure, which is not needed in this scenario, but seems to be very light weight.

Apache has an implementation in their commons-codec package.

shmosel
  • 49,289
  • 6
  • 73
  • 138
SephB
  • 617
  • 3
  • 6
  • 1
    I have a feeling that the I/O cost of reading the file would make the choice of algorithm insignificant. But maybe there's a way to sample bytes instead of reading the whole thing. – shmosel Jun 21 '22 at 20:16
  • 1
    I agree, hashing the first block of the file maybe enough to get a unique hash depending on the file type. – SephB Jun 21 '22 at 20:19
  • In my use case I have to read the file to perform multiple qc checks on it anyway, so the read cost is already incurred. – SephB Jun 21 '22 at 20:20
  • @SephB - When you say it might be just as accurate to hash the first block of the file depending on the type, what kinds of different types would make that method potentially inaccurate, and also if you're just reading the first block of a file, wouldn't it be faster to simply compare the blocks against one another instead of hashing them? – Michael Sims Jul 05 '22 at 22:46
  • @MichaelSims Many file types start with fixed content at the start of the file, each byte of fixed content is going to decrease the amount of variable bytes in the first block. Essentially decreasing the amount of data you are hashing. Yes, it is always faster to not hash than hash if you are reading the files at the same time. If you are processing the files at different times storing and loading a hash saves having to read the file again. – SephB Jul 07 '22 at 15:18