9

I am taking screenshots of an application, and trying to detect if the exact image has been seen before. I am looking to detect trivial changes as different - e.g. if there is text in the image, and the spelling changes, that counts as a mismatch.

I've been successfully using an MD5 hash of the contents of an screen-shot image to lookup in a database of known images, and detect if it has been seen before.

Now, I have ported it to another machine, and despite my attempts to exactly match configurations, I am getting ever-so-slightly different images to the older machine. When I say different, the changes are minute - if I blow up the old and new images and flick between then, I can't see a single difference! Nonetheless, ImageMagick's compare command can see a smattering of pixels that are different.

So my MD5 hashes are no longer matching. Rather than a simple MD5 hash, I need an image hash.

Doing my research, I find that most of the image hashes try to be fairly generous - they accept resized, transformed and watermarked images, with a corresponding false positive matches. I want an image hash that is far more strict - the only changes permitted are minute changes in colour.

Can anyone recommend an image hash library or algorithm? (Not an application, like dupdetector).

Remember: My requirements are different from the many similar questions in that I don't want a liberal algorithm like shrinking or pHash, and I don't want a comparison tool like structural similarity or ImageMagick's compare.

I want a hash that makes very similar images give the same hash value. Is that even possible?

Community
  • 1
  • 1
Oddthinking
  • 24,359
  • 19
  • 83
  • 121
  • 2
    No, that's not possible. There would be no way to know what to discard. What is possible is to develop an image comparison tool that has a tunable threshold for how similar two images have to be. (To see why it's impossible, imagine trying to do a similar thing for, say, plays. To detect, for example, if someone had just changed a few words in a play. The number must either depend on each word or not. So you can't just compare the hashes for equality, you have to measure their distance.) – David Schwartz Apr 21 '12 at 12:18
  • Interesting. Comparing all of the hundreds or thousands of possible matches is infeasible. This is somewhat worrisome. Thank you. – Oddthinking Apr 21 '12 at 12:25
  • You don't have to compare all of the hundreds or thousands of possible matches. You only have to compare the ones that are generally similar. Ones that are completely different can't possibly match. – David Schwartz Apr 21 '12 at 12:25
  • What is the "liberal algorithm"? There is also dhash and idhash ([my own improvement of dhash](https://github.com/Nakilon/dhash-vips)) – Nakilon Jul 31 '18 at 03:29
  • You should be able to use imagehash for this: https://pypi.org/project/ImageHash – Robert Feb 26 '20 at 10:03

1 Answers1

1

You can have a look at the following paper called "Spectral hashing". It is an algorithm that is designed to produce hash codes from images in order to group together similar images (see the retrieval examples at the end of the paper). It is a good starting point.

The link: http://www.cs.huji.ac.il/~yweiss/SpectralHashing/

sansuiso
  • 9,259
  • 1
  • 40
  • 58