4

As an example if I upload an image to imgur twice and once on another website there's a fair chance that all 3 images will have different checksums. jpeg is lossy so I can't simply check if the pixels match.

How do I check if have the same picture encoded different? I don't want to write an algorithm, I want to use a library or offline app via CLI

Additional information: I prefer it to be considered different pictures if it's cropped differently but for my use case it won't matter (and I can simply check the width and height if I want that?)

Eric Stotch
  • 141
  • 4
  • 19
  • 1
    I'm sure there are many possible answers to this, one example: 2x2 blur to get rid of at least some of the artifacts, compute difference of the two images, take average from all pixels, if it's below (5,5,5) the two images were probably the same? –  Oct 31 '20 at 19:34
  • A good discussion is here: https://stackoverflow.com/questions/843972/image-comparison-fast-algorithm –  Oct 31 '20 at 19:35
  • 2
    The question seems to be too general - there is no clear definition when two images are considered to be the same image, and when the images are too different and considered to be two different images. You may compute the MSE - [Mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error), and define some threshold, when below it, the images are the same. Other similarity measurements are [PSNR](https://en.wikipedia.org/wiki/Peak_signal-to-noise_ratio) and [SSIM](https://en.wikipedia.org/wiki/Structural_similarity). I suggest you to edit your question and make it more focused. – Rotem Jun 05 '22 at 08:34
  • Can you add some the same images what about you talk for example in you question? – Daniil Loban Jun 11 '22 at 11:58

3 Answers3

2

There is a Imgur bot called "repoststatistics" that uses dHash to compare images

How the bot works: https://www.hackerfactor.com/blog/index.php?/archives/2013/01/21.html

What library you can use to do the same:

https://github.com/benhoyt/dhash

https://github.com/Rayraegah/dhash

"I've found that dhash is great for detecting near duplicates, but because of the simplicity of the algorithm, it's not great at finding similar images or duplicate-but-cropped images -- you'd need a more sophisticated image fingerprint if you want that. However, the dhash is good for finding exact duplicates and near duplicates, for example, the same image with slightly altered lighting, a few pixels of cropping, or very light photoshopping."

luzede
  • 165
  • 2
  • 7
1

This task definition will definitely include some kind of heuristic. The same image can be represented using a myriad of different ways in memory, so you're basically asking if the images are similar from a human's perspective.

Step 1: Looking at the size of the image could be a very good first step. It's easy, cheap and actually serves as a good way baseline before trying to compare the actual contents of the pictures.

Step 2: Now that we know the images are of the same size, we get to the hard part: how do we compare the contents.

There are a ton of different approaches here. The simplest ones are probably calculating the l-2 Norm or Mean Squared Error (MSE) or using Structural Similarity Index (SSIM).

Furthermore, you can try to account for small color deviations by converting the image to grayscale first.

Here is a python script that compares sizes, converts to grayscale and uses SSIM to compare them with a controllable threshold:

#!/usr/bin/env python

from cv2 import imread, cvtColor, COLOR_BGR2GRAY
from skimage.metrics import structural_similarity

def get_size(im):
    w, h, d = im.shape
    return (w, h)

def main(first, second, ssim_threshold):
    first_im = imread(first)
    first_sz = get_size(first_im)
    second_im = imread(second)
    second_sz = get_size(second_im)
    if first_sz != second_sz:
        print(f'Image sizes differ {first_sz} != {second_sz}')
        return False

    first_gs = cvtColor(first_im, COLOR_BGR2GRAY)
    second_gs = cvtColor(second_im, COLOR_BGR2GRAY)

    ssim = structural_similarity(first_gs, second_gs)
    if ssim < ssim_threshold:
        print(f'Image SSIM lower than allowed threshold [{ssim:.5f} < {ssim_threshold}]')
        return False

    print('Images are the same')
    return True

if __name__ == '__main__':
    import sys
    import argparse

    parser = argparse.ArgumentParser(description='Compare two images')
    parser.add_argument('first', type=str,
            help='First image for comparison')
    parser.add_argument('second', type=str,
            help='Second image for comparison')
    parser.add_argument('ssim', type=float, default=0.95,
            help='Structural similarity minimum (Range: (0, 1], 1 means identical)')

    args = parser.parse_args()
    same = main(args.first, args.second, args.ssim)
    sys.exit(0 if same else 1)

I used it on the following photos:

1 2 3

Like this

>>> python compare.py 1.jpeg 2.jpeg 0.90
Image SSIM lower than allowed threshold [0.88254 < 0.95]
>>> python compare.py 1.jpeg 3.jpeg 0.90
Images are the same

Notes:

  • Notice that with a 0.9 threshold, the same image with a filter was identifies as the same, but the grayscale one wasn't, you should probably find the right threshold for your usecase
  • The script exits with 0 = equal and 1 = not-equal, so that you can automate it
  • It uses Python3 and the following packages:
opencv-python
scikit-image
Daniel Trugman
  • 8,186
  • 20
  • 41
0

Yes, you can simply check the width and height if you want to consider pictures different if they're cropped differently. That could be done before (or, if different, instead of) the following procedure to compare picture A and picture B:

  • If A and B are of the same type (and, for lossy types, quality), diff them.
  • If A and B are of different lossless types, convert them to the same lossless type (by converting A to the type of B, B to the type of A, or A and B to a third lossless type), and diff them.
  • If A is of a lossless type and B is of a lossy type, convert A to the type of B of the same quality, and diff them.
  • If A and B are of different lossy types or qualities, whether the sources of A and B were the same is generally unknowable because some information was lost. In this case, the best you can do is decide whether A and B are similar enough by a method such as the ones described in the comments on your question; but, beware, however unlikely, different encodings of different image data may look the same.
greg-tumolo
  • 698
  • 1
  • 7
  • 30