Need a good algorithm to categorize 8GB of pictures

Question

I have about 150.000 pictures and some of these are duplicates. I have figured that the SSIM algorithm is a good choice to compare two pictures and see if they are duplicates. However if I want to find duplicates that way, I will have to compare 150.000 * 149.999 pictures which would take forever.

So what I am looking for now is a fast and effective algorithm to create an average value for each picture and then only compare images which come close to their average value.

In short: I am looking for a effective way to categorize pictures!

I plan to use the C++ CImg library for this task because it's fast.

Thanks!

Are duplicate pictures really duplicates (same width & height) or can be resized versions of the original picture? — Vincent Mimoun-Prat, Dec 17 '12 at 19:20
The pictures can have slight differences and only have the same width, the height can vary. — moccajoghurt, Dec 17 '12 at 19:23
Isn't there a method that sort this photos with respect to their width? — Salih Erikci, Dec 17 '12 at 19:26
Can't you just sort the images by width then? Only images with the same width need to be compared against each others. — jalf, Dec 17 '12 at 19:30
There are pictures that vary in height but are basically the same picture that just have an unrelated box on the bottom which changes the height. — moccajoghurt, Dec 17 '12 at 19:36
@jalf: I can imagine that lots of images could be having the same width, possibly all of them... My question was more to help choose the type of algorithm that could be used. — Vincent Mimoun-Prat, Dec 17 '12 at 19:36
Still, if this is a one-shot operation I think the first step would be to look at the distributions of width of the pictures. For those widths which only have a manageable number of images you might as well just run all `n^2` comparisons. If there are any groups left that are too big for that (like, if half the images are the same width), consider doing something that requires actual thought on your part ;-) — Steve Jessop, Dec 17 '12 at 19:53
@user1910856: Are they "basically" the same picture, or *exactly* the same picture apart from the box at the bottom? I ask because depending on the answer the appropriate technique will be quite different. — Yaniv, Dec 17 '12 at 23:32
The second comment has clarified that identical images are actually slightly different. Also, the use of SSIM should make that clear. If the images were equal except because of height, some simple pre-processing and any hashing would easily solve this. — mmgp, Dec 18 '12 at 00:29

Vincent Mimoun-Prat · Answer 1 · 2012-12-17T19:46:51.463

There are pictures that vary in height but are basically the same picture that just have an unrelated box on the bottom which changes the height.

If the top of the picture is always the same for two duplicates, you could try to compute a hash value based on N lines of pixels in the image that are supposed to be pretty safe (i.e. your box in the bottom won't be in those lines).

Once you have hashed all your files, you can sort the hash values and compare more precisely only pictures with the same hash value.

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

I'd try a hash/fingerprint like approach:

Generation of a fingerprint for each image, containing also relevant image attributes such as size and number of components for a metafile or a database. The fingerprint could be computed from the common subimage, this may be a lossy compressed spectrogram, a simple vector containing the frequency bins of an FFT, a histogram or another technique (I don't have a real clue what fits better, this is most probably very content dependent).
As littlestewie mentioned, grouping beforehand with image attributes such as size and number of color components will greatly reduce the number of (binary) comparisons, which would be (n*(n-1))/2 for each group.
Comparison of the fingerprints with appropriate tolerance for further sub-grouping (pay attention to cover the cases, where one image has matches in multiple groups).
OpenCV could do the final match:

How to detect the Sun from the space sky in OpenCv?

Related questions regarding image comparison using OpenCV:

score 2 · Answer 3 · edited May 23 '17 at 12:34

Using any form of hashing is really pointless here, since even nearly-identical images will give very distinct hash values. As was pointed in the comments, two "duplicate" images can be slightest different (think for example the effects caused by JPEG compression), so the interest is in detecting nearly-duplicated images. Also, as was pointed in the comments, considering only images of same width is a first step to reduce your quadratic number of comparisons. If all the images are of same width, there is no improvement.

The first problem you have to solve is discarding the bottom box of nearly-identical images with different heights. Why is this box there ? Is it an uniform background color ? Preprocess your images to remove such bottom boxes, if it is problematic to do so, explain why. I will consider that these boxes have been removed from now on.

The SSIM (Structural SIMilarity) may be a good approach for detecting your nearly-duplicates, but it has no chance to be faster than a simpler algorithm such as the NRMSE described at Comparing image in url to image in filesystem in python. So, a way to possibly speed up the process (although in nature it remains quadratic) is to first convert a given image to grayscale and only consider a small central window from it, like 50x50. Apply a gaussian filter on this central window, so minor structures (noise, for example) are mostly suppressed. Since you have quite a few images to compare against, I would apply a rough binarization in this smoothed central window in the form of: if a value v is greater than half of the maximum value possible, then turn it into white, otherwise turn it into black. Now you have 2500 bits for each image. The next step could be the following: calculate the hamming distance from these 2500 bits to a common bit pattern, 2500 bits 1 would work here. Repeat this process for all your images. For each image you have a hamming distance.

Now let us find the nearly identical images. First, consider binning the found hamming distances in k distinct slots. So, all the images that fall in the same bin are further considered for comparison. This way, if a image a lands in bin k_i and image b lands in bin k_j, i != j, we discard a as being similar to b. If there are too many images in the same bin, the process described above needs refinements and/or the interval for each bin needs to be reduced. To further speed up the process, consider first applying the NRMSE between all the images in the same bin, and only those that give a high value would be, at last, compared by SSIM.

I disagree that hashing is pointless. *Exact* hashing is pointless, but [Locality Sensitive Hashing](http://en.wikipedia.org/wiki/Locality-sensitive_hashing) is actually a great candidate for this sort of problem. — Yaniv, Dec 17 '12 at 23:33
So we are in agreement that hashing is pointless. Now, for whatever other word I used, I'm sure one can find some `variation`-`blah` that does something different from the raw word `blah`. I would agree with your idea if you described in a meaningful way how `variation`-`hash` applies here. — mmgp, Dec 17 '12 at 23:53
you asserted that any form of hashing is pointless; I merely pointed out that *some* forms of hashing are in fact very useful here. The Wikipedia page mentions image similarity identification as one of the applications. As for how it applies, consider your SSIM technique as described above, and [bit sampling](http://en.wikipedia.org/wiki/Locality-sensitive_hashing#Bit_sampling_for_Hamming_distance) to generate a number of LSH hashes. Then you can use the number of matching hashes between images as a signal that they may be near-dupes. This may be faster than the binning mentioned above. — Yaniv, Dec 18 '12 at 00:15
SSIM is only applied after everything else mentioned, so there is some contradiction in your suggestion. The binning method can be done in linear time, I really doubt you can fully apply LSH in (sub-)linear time starting from the initial images. Now, to me, it looks like the final part of your suggestion is to apply LSH on those 2500 bits I arrived at. If that is the case, it might be useful then, but it remains useless as the first step in the whole comparison scheme. So either I misunderstood all that you said, or it is only applicable after the method has arrived at a binary representation. — mmgp, Dec 18 '12 at 00:21
I think we basically agree, the seeming disagreement is just a matter of different interpretations. — Yaniv, Dec 18 '12 at 00:33

score 0 · Answer 4 · answered Dec 17 '12 at 20:50

MarvinLabs already pointed out the idea of hashing the first N lines.

If you use a good hash (like MD5) over the first N (say 20) lines you can be quite sure that hash collisions identify identical images. put the hash together with the other unique image identifier (file name?) into a std::multimap. this multimap will cost you depending on path length about 10MB to 100MB and may be kept in memory easily. you can do your reports after the hashing. if you are paranoid you do another image comparison for the collisions. unless all images are from the same CCTV camera the chance of a false positive is smaller than a read error from disk.

Need a good algorithm to categorize 8GB of pictures

4 Answers4