3

I have a bunch of images (from the M.C. Escher collection) i want to organize, so first step i had in mind is to group them up, by comparing them (you know, some have different resolutions/shapes, etc).

i wrote a very brutal script to: * read the files * compute their histograms * compare them

but the quality of the comparison is really low, like there are files matching that are absolutely different

take a look at what i wrote so far:

Preparing the histograms

files_hist = {}

for i, f in enumerate(files):
    try:
        frame = cv2.imread(f)
        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
        hist = cv2.calcHist([frame],[0],None,[4096],[0,4096])
        cv2.normalize(hist, hist, alpha=0, beta=1, norm_type=cv2.NORM_MINMAX)

        files_hist[f] = hist
    except Exception as e:
        print('ERROR:', f, e)

Comparing the histograms

pairs = list(itertools.combinations(files_hist.keys(), 2))

for i, (f1, f2) in enumerate(pairs):
    correl = cv2.compareHist(files_hist[f1], files_hist[f2], cv2.HISTCMP_CORREL)

    if correl >= 0.999:
        print('MATCH:', correl, f1, f2)

now, for example i get a match for these 2 files:

m._c._escher_244_(1933).jpg m._c._escher_244_(1933).jpg

and

m._c._escher_208_(1931).jpg m._c._escher_208_(1931).jpg

and their correlation, using the code above, is 0.9996699595530539 (so their practically the same :( )

what am i doing wrong? how can i improve that code to avoid this false matches?

thanks!

Sandro Tosi
  • 61
  • 1
  • 7
  • You can try to use `cv::norm()` with `NORM_L2` method to compare two images, – Bahramdun Adil Apr 06 '19 at 17:57
  • Which is the purpose of your comparison. Find the similar ones? Find duplicates, something else? – Eypros Apr 06 '19 at 18:52
  • the purpose is to find duplicates, yes that's the main purpose (i also have another project that compares videos, by grabbing frames at specific intervals and comparing them) – Sandro Tosi Apr 06 '19 at 21:59

1 Answers1

4

Histograms are not a good way to compare images, in black and white images, for example, if they have the same amount of black pixels, the histograms will be identical, regardless on the pixels distributions in the image (that is why the images you mentioned are classified as almost equal).

There are better ways to quantify the difference between images, this post mentions a good option:

  • Load both images as arrays (scipy.misc.imread) and calculate an element-wise (pixel-by-pixel) difference. Calculate the norm of the difference.

edit:

Answering some questions:

I take the zero norm per-pixel is going to be 0.0-1.0 value, with values close to 0.0 meaning "images are the same", correct?

Values close to 0.0 means the pixels are the same. To compare the images as a whole you need to sum over all pixels. If the summed value is close to 0.0 this means the images are almost the same.

what if the 2 image sizes are different?

that's a good one. To calculate the norm difference the images must have the same size. I see two ways to achieve that:

  • the first would be resizing one of the images to the shape of the other one, the problem is that this can cause distortion in the image.

  • the second would be padding the smaller image with zeros until the sizes match.

obs: if you sum over the pixel-wise norm you will have a value between zero and the number of pixels in the image. This can be confusing if you are comparing multiple images. For example, suppose you are comparing images A and B and both have shape 50x50 (therefore, the images have 2500 pixels); values close to 2500 mean the images are completely different. Now suppose you are comparing images C and D and both have shape 1000x1000, in this case, values like 2500 would mean the images are similar. To overcome this problem you can divide the pixel-wise sum over the number of pixels in the image, this will result in a value between 0.0 and 1.0, 0.0 meaning the images are the same and 1.0 meaning they are completely different.

yeah here's the error i received when comparing 2 images with different size diff = image1 - image2 ValueError: operands could not be broadcast together with shapes (850,534) (663,650)

This happens because the images have different shapes. Resizing or padding can avoid this error (as mentioned above).

  • thanks Heitor! that's a really interesting method, i'm just going ahead and try it out. Some questions: * I take the zero norm per-pixel is going to be 0.0-1.0 value, with values close to 0.0 meaning "images are the same", correct? * what if the 2 image sizes are different? the zero norm per-pixel will be slightly off, no (since it uses `img1.size`)? – Sandro Tosi Apr 06 '19 at 22:08
  • yeah here's the error i received when comparing 2 images with different size ``` diff = image1 - image2 ValueError: operands could not be broadcast together with shapes (850,534) (663,650) ``` – Sandro Tosi Apr 06 '19 at 22:44
  • i also used https://stackoverflow.com/a/49574931/1929629 and also the skimage functions as mentioned in the comments at https://stackoverflow.com/questions/189943/how-can-i-quantify-difference-between-two-images#comment92449874_49574931 but still no luck: either very few matches (even if the images are "visibly" the same) or too broad matches :( – Sandro Tosi Apr 07 '19 at 01:57