Duplicate photo searching with compare only pure imagedata and image similarity?

Question

Having approximately 600GB of photos collected over 13 years - now stored on freebsd zfs/server.

Photos comes from family computers, from several partial backups to different external USB HDDs, reconstructed images from disk disasters, from different photo manipulation softwares (iPhoto, Picassa, HP and many others :( ) in several deep subdirectories - shortly = TERRIBLE MESS with many duplicates.

So in the first i done:

searched the the tree for the same size files (fast) and make md5 checksum for those.
collected duplicated images (same size + same md5 = duplicate)

This helped a lot, but here are still MANY MANY duplicates:

photos what are different only with exif/iptc data added by some photo management software, but the image is the same (or at least "looks as same" and have the same dimensions)
or they are only a resized versions of the original image
or they are the "enhanced" versions of originals, etc..

Now the questions:

how to find duplicates withg checksuming only the "pure image bytes" in a JPG without exif/IPTC and like meta informations? So, want filter out the photo-duplicates, what are different only with exif tags, but the image is the same. (therefore file checksuming doesn't works, but image checksuming could...). This is (i hope) not very complicated - but need some direction.
What perl module can extract the "pure" image data from an JPG file what is usable for comparison/checksuming?

More complex

how to find "similar" images, what are only the
- resized versions of the originals
- "enchanced" versions of the originals (from some photo manipulation programs)
is here already any algorithm available in a unix command form or perl module (XS?) what i can use to detect these special "duplicates"?

I'm able make complex scripts is BASH and "+-" :) know perl.. Can use FreeBSD/Linux utilities directly on the server and over the network can use OS X (but working with 600GB over the LAN not the fastest way)...

My rough idea:

delete images only at the end of workflow
use Image::ExifTool script for collecting duplicate image data based on image-creation date, and camera model (maybe other exif data too).
make checksum of pure image data (or extract histogram - same images should have the same histogram) - not sure about this
use some similarity detection for finding duplicates based on resize and foto enhancement - no idea how to do...

Any idea, help, any (software/algorithm) hint how to make order in the chaos?

Ps:

Here is nearly identical question: Finding Duplicate image files but i'm already done with the answer (md5). and looking for more precise checksuming and image comparing algorithms.

score 3 · Answer 1 · answered Feb 20 '15 at 16:51

Assuming you can work with localy mounted FS:

rmlint : fastest tool I've ever used to find exact duplicates
findimagedupes : automatize the whole ImageMagick way (as Randal Schwartz's script that I haven't tested? it seems)
Detecting Similar and Identical Images Using Perseptual Hashes goes all the way (a great reference post)
dupeguru-pe (gui) : dedicated tool that is fast and does an excellent job
geeqie (gui) : I find it fast/excellent to finish the job, using the granular deduplication options. Also then you can generate an ordered collection of images such that 'simular images are next to each other, allowing you to 'flip' between the two to see the changes.

The link changed http://hackerlabs.github.io/blog/2012/07/30/organizing-photos-with-duplicate-and-similarity-checking/ — mkraemerx, Oct 10 '16 at 05:03

score 2 · Answer 2 · answered Jun 22 '13 at 13:16

You can remove exif data with mogrify -strip from ImageMagick toolset. So you could, for each image, copy it without exif, md5sum, and then compare md5sums.

When it comes to visually similar messages - you can, for example, use compare (also from ImageMagick toolset), and produce black/white diff map, like described here, then make histogram of the difference and check if there is "enough" white to mean that it's different.

while we are using Magick, you can (before compare) resizing them to the same (small) size, so comparisons will be faster... — Massa, Jun 22 '13 at 15:05

score 2 · Accepted Answer · answered Aug 28 '13 at 03:30

2

Have you looked at this article by Randal Schwartz? He uses a perl script with ImageMagick to compare resized (4x4 RGB grid) versions of the pictures that he then compares in order to flag "similar" pictures.

answered Aug 28 '13 at 03:30

gsiems

3,500
1
22
24

score 0 · Answer 4 · edited Nov 20 '20 at 13:43

I had a similar dilemma - several hundred gigs of photos and videos spread and duplicated over about a dozen drives. I know this may not be the exact way you are looking for, but the FSlint Janitor application (on Ubuntu 16.x, then 18.x) was a lifesaver for me. I took the project in chunks, eventually cleaning it all up and ended up with three complete sets (I wanted two off-site backups).

FSLint Janitor:

sudo apt install fslint

Duplicate photo searching with compare only pure imagedata and image similarity?

4 Answers4

Linked