Computing hash of large file to detect duplicates in storage

Question

I want to detect duplicate files in storage by computing hash for them. But some files could be large and it won't be cheap to retrieve the full content of a file to calculate. The algorithm could be MD5 and files could be video, audio, images. Is it a good idea to compute the hash for the small first part of such large files, e.g. 1MB?

Does this answer your question? [Getting a File's MD5 Checksum in Java](https://stackoverflow.com/questions/304268/getting-a-files-md5-checksum-in-java) — Shawn, Feb 08 '22 at 10:12
Basically: just hash the entire file. There are lots of ways to do it efficiently. — Shawn, Feb 08 '22 at 10:16
Opening every file to checksum may be slow on large filesystems, but if you start by matching file sizes first then you only need to use checksum or file with file comparisons on a much smaller group of files where length is identical. This will be particularly effective on directories with irregular files (eg image / audio) where length matches are uncommon and saves need to read 1000s of files. — DuncG, Feb 08 '22 at 10:35

score 1 · Accepted Answer · edited Feb 08 '22 at 11:29

1

You can try the file size and the beginning of the file 1MB.

edited Feb 08 '22 at 11:29

ouflak

2,458
10
44
49

answered Feb 08 '22 at 09:34

Icodeeeee

41
2

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 08 '22 at 11:29

Computing hash of large file to detect duplicates in storage

1 Answers1