I have around 1 TB of images, stored in my hard disk. These are pictures taken over time of friends and family. Many of these pictures are duplicates, in the sense, same file saved in different locations, probably with different name too. I want to ask is there any tool, utility or approach(I can code one ) to find out the duplicate files.
Asked
Active
Viewed 697 times
1 Answers
6
I would recommend using md5deep or sha1deep. On Linux simply install package md5deep
(it is included in most Linux distributions).
Once you have it installed, simply run it in recursive mode over your whole disk and save checksums for every file on your disk into text file using command like this:
md5deep -r -l . > filelist.txt
If you like sha1
better than md5
, use sha1deep
instead (it is part of the same package).
Once you have a file, simply sort it using sort
(or pipe it into sort
in previous step):
sort < filelist.txt > filelist_sorted.txt
Now, simply look at the result using any text editor - you will quickly see all the duplicates alongside with their locations on disk.
If you are so inclined, you can write simple script in Perl or Python to remove duplicates based on this file list.

mvp
- 111,019
- 13
- 122
- 148
-
Is there some thing available in windows. – abhinav Mar 06 '13 at 05:42
-
Also just curious, would this be a good example to try some map reduce code, if the image data would increase to much larger volume. – abhinav Mar 06 '13 at 05:43
-
I guess why not - you can distribute hashing cpu load over many hosts – mvp Mar 06 '13 at 05:44