I have a folder and in which I have around 20000 images.I want to delete the images which are exact duplicate.My plan is to calculate the hash of the images using MD5 and then remove the images that has exact hash value.But It is taking a lot of time comparing the hash value of one images with the another 20000 images.Please suggest me an optimized solution so that I can make this process faster and remove the duplicate images.
Asked
Active
Viewed 553 times
0
-
2Collect the hashes and associated filenames, sort by hash, iterate list removing duplicates. – Roger Rowland Oct 09 '15 at 07:03
-
Exactly as @RogerRowland suggests, but actually calculate the hashes in parallel first with `GNU Parallel` -something like `parallel md5sum {} ::: *.jpg > hashes.txt` depending on your OS. – Mark Setchell Oct 09 '15 at 07:06
-
You may also like to look here http://stackoverflow.com/a/28834788/2836621 and here http://stackoverflow.com/a/22724295/2836621 – Mark Setchell Oct 09 '15 at 07:25
-
If file size may differ, sort by size first, and compare hashes in group of the same size. – MBo Oct 09 '15 at 08:41