0

I want to detect duplicate files in storage by computing hash for them. But some files could be large and it won't be cheap to retrieve the full content of a file to calculate. The algorithm could be MD5 and files could be video, audio, images. Is it a good idea to compute the hash for the small first part of such large files, e.g. 1MB?

  • 1
    Does this answer your question? [Getting a File's MD5 Checksum in Java](https://stackoverflow.com/questions/304268/getting-a-files-md5-checksum-in-java) – Shawn Feb 08 '22 at 10:12
  • Basically: just hash the entire file. There are lots of ways to do it efficiently. – Shawn Feb 08 '22 at 10:16
  • Opening every file to checksum may be slow on large filesystems, but if you start by matching file sizes first then you only need to use checksum or file with file comparisons on a much smaller group of files where length is identical. This will be particularly effective on directories with irregular files (eg image / audio) where length matches are uncommon and saves need to read 1000s of files. – DuncG Feb 08 '22 at 10:35
  • Use SHA-512 done. The dupe contains other hashes, so dupe! – kelalaka Feb 08 '22 at 21:36

1 Answers1

1

You can try the file size and the beginning of the file 1MB.

ouflak
  • 2,458
  • 10
  • 44
  • 49
Icodeeeee
  • 41
  • 2
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 08 '22 at 11:29