Having approximately 600GB of photos collected over 13 years - now stored on freebsd zfs/server.
Photos comes from family computers, from several partial backups to different external USB HDDs, reconstructed images from disk disasters, from different photo manipulation softwares (iPhoto, Picassa, HP and many others :( ) in several deep subdirectories - shortly = TERRIBLE MESS with many duplicates.
So in the first i done:
- searched the the tree for the same size files (fast) and make md5 checksum for those.
- collected duplicated images (same size + same md5 = duplicate)
This helped a lot, but here are still MANY MANY duplicates:
- photos what are different only with exif/iptc data added by some photo management software, but the image is the same (or at least "looks as same" and have the same dimensions)
- or they are only a resized versions of the original image
- or they are the "enhanced" versions of originals, etc..
Now the questions:
- how to find duplicates withg checksuming only the "pure image bytes" in a JPG without exif/IPTC and like meta informations? So, want filter out the photo-duplicates, what are different only with exif tags, but the image is the same. (therefore file checksuming doesn't works, but image checksuming could...). This is (i hope) not very complicated - but need some direction.
- What perl module can extract the "pure" image data from an JPG file what is usable for comparison/checksuming?
More complex
- how to find "similar" images, what are only the
- resized versions of the originals
- "enchanced" versions of the originals (from some photo manipulation programs)
- is here already any algorithm available in a unix command form or perl module (XS?) what i can use to detect these special "duplicates"?
I'm able make complex scripts is BASH and "+-" :) know perl.. Can use FreeBSD/Linux utilities directly on the server and over the network can use OS X (but working with 600GB over the LAN not the fastest way)...
My rough idea:
- delete images only at the end of workflow
use Image::ExifTool
script for collecting duplicate image data based on image-creation date, and camera model (maybe other exif data too).- make checksum of pure image data (or extract histogram - same images should have the same histogram) - not sure about this
- use some similarity detection for finding duplicates based on resize and foto enhancement - no idea how to do...
Any idea, help, any (software/algorithm) hint how to make order in the chaos?
Ps:
Here is nearly identical question: Finding Duplicate image files but i'm already done with the answer (md5). and looking for more precise checksuming and image comparing algorithms.