Using any form of hashing is really pointless here, since even nearly-identical images will give very distinct hash values. As was pointed in the comments, two "duplicate" images can be slightest different (think for example the effects caused by JPEG compression), so the interest is in detecting nearly-duplicated images. Also, as was pointed in the comments, considering only images of same width is a first step to reduce your quadratic number of comparisons. If all the images are of same width, there is no improvement.
The first problem you have to solve is discarding the bottom box of nearly-identical images with different heights. Why is this box there ? Is it an uniform background color ? Preprocess your images to remove such bottom boxes, if it is problematic to do so, explain why. I will consider that these boxes have been removed from now on.
The SSIM (Structural SIMilarity) may be a good approach for detecting your nearly-duplicates, but it has no chance to be faster than a simpler algorithm such as the NRMSE described at Comparing image in url to image in filesystem in python. So, a way to possibly speed up the process (although in nature it remains quadratic) is to first convert a given image to grayscale and only consider a small central window from it, like 50x50. Apply a gaussian filter on this central window, so minor structures (noise, for example) are mostly suppressed. Since you have quite a few images to compare against, I would apply a rough binarization in this smoothed central window in the form of: if a value v
is greater than half of the maximum value possible, then turn it into white, otherwise turn it into black. Now you have 2500 bits for each image. The next step could be the following: calculate the hamming distance from these 2500 bits to a common bit pattern, 2500 bits 1 would work here. Repeat this process for all your images. For each image you have a hamming distance.
Now let us find the nearly identical images. First, consider binning the found hamming distances in k
distinct slots. So, all the images that fall in the same bin are further considered for comparison. This way, if a image a
lands in bin k_i
and image b
lands in bin k_j
, i != j
, we discard a
as being similar to b
. If there are too many images in the same bin, the process described above needs refinements and/or the interval for each bin needs to be reduced. To further speed up the process, consider first applying the NRMSE between all the images in the same bin, and only those that give a high value would be, at last, compared by SSIM.