Suppose there are 10,000 JPEG, PNG images in a gallery, how to find all images with similar color palettes to a selected image sorted by descending similarity?
-
2Possible duplicate: http://stackoverflow.com/questions/593925/how-do-i-find-images-with-a-similar-color-using-python-and-pil – ChristopheD Nov 10 '09 at 00:08
-
Yeah, but there are no good answers on that question. :-) – Frank Krueger Nov 10 '09 at 00:35
-
There's a lot of similar discussion here: http://stackoverflow.com/questions/1034900/near-duplicate-image-detection/1048723#1048723 – Paul Nov 10 '09 at 00:40
-
Here is another contender https://github.com/larytet-py/image-mathching The code groups the matching colors, adds percentage of the area occupied by the color group. – Larytet Sep 16 '19 at 13:17
1 Answers
Build a color histogram for each image. Then when you want to match an image to the collection, simply order the list by how close their histogram is to your selected image's histogram.
The number of buckets will depend on how accurate you want to be. The type of data combined to make a bucket will define how you prioritize your search.
For example, if you are most interested in hue, then you can define which bucket your each individual pixel of the image goes into as:
def bucket_from_pixel(r, g, b):
hue = hue_from_rgb(r, g, b) # [0, 360)
return (hue * NUM_BUCKETS) / 360
If you also want a general matcher, then you can pick the bucket based upon the full RGB value.
Using PIL, you can use the built-in histogram
function. The "closeness" histograms can be calculated using any distance measure you want. For example, an L1 distance could be:
hist_sel = normalize(sel.histogram())
hist = normalize(o.histogram()) # These normalized histograms should be stored
dist = sum([abs(x) for x in (hist_sel - hist)])
an L2 would be:
dist = sqrt(sum([x*x for x in (hist_sel - hist)]))
Normalize
just forces the sum of the histogram to equal some constant value (1.0 works fine). This is important so that large images can be correctly compared to small images. If you're going to use L1 distances, then you should use an L1 measure in normalize
. If L2, then L2.

- 69,552
- 46
- 163
- 208
-
@Frank, thanks for your advice. Could you give me some example code in Python? PIL's build-in histogram() function returns a list, how to determine how close two images' histograms are? – jack Nov 10 '09 at 00:18
-
@Frank, looks like it requires 10,000 distance calculations when picking images with similar histogram out of 10,000 candidates? is it possible to associate numeric values with each image and store them in database thus comparison can be simplified to some sql queries? – jack Nov 10 '09 at 01:55
-
@jack, 10,000 calcs isn't really that expensive. The best way to speed up code like this is not to reduce the histograms into integers (which can't be done the way you think) but to simply **cache** the results. Cache the sort order (per image) in the database or cache it in memory. Make sure you also store the histogram in the database or in memory so that rebuilding those sort order caches isn't expensive. – Frank Krueger Nov 10 '09 at 02:07