22

My users are uploading images to my website and i would like first to offer them already uploaded images first. My idea is to 1. create some kind of image "hash" of every existing image 2. create a hash of newly uploaded image and compare it with the other in the database

i have found some interesting solutions like http://www.pureftpd.org/project/libpuzzle or or http://phash.org/ etc. but they got one or more problems

  1. they need some nonstandard extension to PHP (or are not in PHP at all) - it would be OK for me, but I would like to create it as a plugin to my popular CMS, which is used on many hosting environments without my control.
  2. they are comparing two images but i need to compare one to many (e.g. thousands) and doing it one by one would be very uneffective / slow ... ...

I would be OK to find only VERY similar images (so e.g. different size, resaved jpg or different jpg compression factor).

The only idea I got is to resize the image to e.g. 5px*5px* 256 colors, create a string representation of it and then find the same. But I guess that it may have create tiny differences in colors even with just two same images with different size, so finding just the 100 % same would be useless.

So I would need some good format of that string representation of image which than could be used with some SQL function to find similar, or some other nice way. E.g. phash create perceptional hashes, so when two numbers are close, the images should be close as well, so i just need to find closest distances. But it is again external library.

Is there any easy way?

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
Tomáš Kapler
  • 439
  • 1
  • 4
  • 5
  • your idea was not that bad, and 256 colors won't give you "tiny differences". If so, lower that number. Another important issue: your image hash should be good enough to deal with small image rotations. – madfriend Jul 04 '12 at 18:03
  • An idea I just had about handling image rotations in the hash is to divide the hash into four equally-sized parts and rotate the image so that the one with the lowest average value on the bottom left. – Simon Forsberg Jul 05 '12 at 12:12
  • 1
    pHash doesn't "compare two images". It calculates a hash value for each image with the idea that similar images would have similar hashes. You can then use special data structures to store your image hashes and efficiently look for hashes (e.g. images) similar to the hash of the uploaded image. – jamix Sep 17 '13 at 09:16

4 Answers4

24

I've had this exact same issue before.

Feel free to copy what I did, and hopefully it will help you / solve your problem.


How I solved it

My first idea that failed, similar to what you may be thinking, is I ended up making strings for every single image (no matter what size). But I quickly worked out this fills your database super fast, and wasn't effective.

Next option (that works) was a smaller image (like your 5px idea), and I did exactly that, but with 10px*10px images. The way I created the 'hash' for each image was the imagecolorat() function.

See php.net here.

When receiving the rgb colours for the image, I rounded them to the nearest 50, so that the colours were less specific. That number (50) is what you want to change depending on how specific you want your searches to be.

for example:

// Pixel RGB
rgb(105, 126, 225) // Original
rgb(100, 150, 250) // After rounding numbers to nearest 50

After doing this to every pixel (10px*10px will give you 100 rgb()'s back), I then turned them into an array, and stored them in the database as base64_encode() and serialize().

When doing the search for images that are similar, I did the exact same process to the image they wanted to upload, and then extracted image 'hashes' from the database to compare them all, and see what had matching rounded rgb's.


Tips

  • The Bigger that 50 is in the rgb rounding, the less specific your search will be (and vice versa).

  • If you want your SQL to be more specific, it may be better to store extra/specific info about the image in the database, so that you can limit the searches you get in the database. eg. if the aspect ratio is 4:3, only pull images around 4:3 from the database. (etc)

  • It can be difficult to get this perfectly 5px*5px, so a suggestion is phpthumb. I used it with the syntax:

phpthumb.php?src=IMAGE_NAME_HERE.png&w=10&h=10&zc=1
// &w=  width of your image
// &h=  height of your image
// &zc= zoom control. 0:Keep aspect ratio, 1:Change to suit your width+height

Good luck mate, hope I could help.

jay
  • 916
  • 1
  • 5
  • 13
  • This is a good answer. Something to share with you and others, is the fact that when you round RGB values to the nearest 50, you are bound to get duplicate colours (I did for many photos). By using PHP's `array_unique()` function, this cleared out all the duplicates and left me with only 28 colours to store - a much lesser amount to worry about. – TheCarver Sep 10 '13 at 22:15
  • 5
    It's not correct that rounding RGB trio of numbers to the nearest 50 gives you nearest colors, and therefore thinking big, similar images. 124, 76, 76 and 76,76,124 (reddish and bluish hues) would turn into 100,100,100 (grey). It would be better to convert RGB to a integer (from 0 to 16777216) and then round in houndreds or thousands. That would gives you better approach to similar hues and colors. – FlamingMoe Feb 06 '14 at 07:33
  • I have a similar problem (want to identify duplicate photos before upload) but I don't see how what you are doing here is any different to phash (https://github.com/jenssegers/imagehash)? Why is this method advantageous? They both produce a string that has to be compared to find near duplicates. – TinyTiger Mar 07 '18 at 09:22
2

For an easy php implementation check out: https://github.com/kennethrapp/phasher

However - I wonder if there is a native mySql function for "compare" (see php class above)

sebilasse
  • 4,278
  • 2
  • 37
  • 36
0

I scale down image to 8x8 then I convert RGB to 1-byte HSV so result hash is 172 bytes string.

HSVHSVHSVHSVHSVHSVHSVHSV... (from 8x8 block, 172 bytes long)
0fff0f3ffff4373f346fff00...

It's not 100% accurate (some duplicates aren't found) but it works nice and looks like there is no false positive results.

Peter
  • 16,453
  • 8
  • 51
  • 77
0

Putting it down in an academical way, what you are looking for is a similarity function which takes in two images and returns an indicator how far/similar the two images are. This indicator could easily be a decimal number ranging from -1 to 1 (far apart to very close). Once you have this function you can set an image as a reference and compare all the images against it. Then finding the similar images to one is as simple as finding the closest similarity factor to it which is done with a simple search over a double field within an RDBMS like MySQL.

Now all that remains is how to define the similarity function. To be honest this is problem specific. It depends on what you call similar. But covariance is usually a good starting point, it just needs your two images to be of the same size which I think is of no big deal. Yet you can find lots of other ideas searching for 'similarity measures between two images'.

Mehran
  • 15,593
  • 27
  • 122
  • 221