1

Problem:
I've got around 10,000 images to compare to each other. My current program compares around 60 images every second, but at that speed, it would take nearly 9 days of runtime to finish. I've tried using c++ but the final code would take nearly 3x as long as the python one.

Question:
Is there any faster or more efficient way to compare images? I'm fine with using other languages and other libraries.

Code:

from PIL import Image
from PIL import ImageChops
import math, operator
from functools import reduce
import os

def rmsdiff(image_1, image_2):
    h = ImageChops.difference(image_1, image_2).histogram()
    return math.sqrt(reduce(operator.add, map(lambda h, i: i%256*(h**2), h, range(len(h)))) / (float(image_1.size[0]) * image_1.size[1]))

current = 0
try:
    dire = "C:\\Users\\Nikola\\Downloads\\photos"
    photos = os.listdir(dire)
    for idx, val in enumerate(photos):
        if val == "":
            start = idx
            break
    for photo_1 in range(start,len(photos)):
        if "." not in photos[photo_1]:
            continue
        print(f'Image: {photos[photo_1]}')
        with Image.open(dire+"\\"+photos[photo_1]) as image_1:
            image_1 = image_1.resize((16,16))
            for photo_2 in range(photo_1+1, len(photos)):
                current = photos[photo_2]
                try:
                    if photos[photo_2][-4] != "." and photos[photo_2][-5] != ".":
                        continue
                except:
                    continue
                with Image.open(dire+"\\"+photos[photo_2]) as image_2:
                    image_2 = image_2.resize((16,16))
                    try:
                        value = rmsdiff(image_1, image_2)
                        if value < 12:
                            print(f'Similar Image: {photos[photo_1]}')
                            continue
                    except:
                        pass
except KeyboardInterrupt:
    print()
    print(current)
Gulzar
  • 23,452
  • 27
  • 113
  • 201
  • 1
    What are you comparing them with? are you just checking if they are identical or what is the heuristic that you use for the comparison – Alexander Jan 01 '23 at 10:20
  • What are you actually trying to achieve? What do you know about the images - are they all the same size? Same format? Same subject? What do you know about the machine you are using - does it have multiple processors/cores? Fast NVMe disks? Lots of RAM? – Mark Setchell Jan 01 '23 at 10:29
  • @Alexander I'm comparing each image to each other image that could be resized, blurred, there could be artifacts across the image. It should be similar enough that there are small amounts of changes to the image. – nikola markoski Jan 01 '23 at 10:33
  • @MarkSetchell I'm trying to see if all the images are unique to a certain extent. All the images are of differing sizes and formats, but most are of the same subject. I don't think my laptop has multiple processors, it does not have an NVMe disk and there are 8gb of RAM – nikola markoski Jan 01 '23 at 10:41
  • You seem to be comparing each image against all the ones following it in the directory listing. That means the last image will get opened and resized to 16x16 for a total of nearly 10,000 times. If you are only comparing 16x16 versions, you have enough RAM to hold all images resized to 16x16 at once, so you only open them and resize once. – Mark Setchell Jan 01 '23 at 10:48
  • You could maybe use `functools.lru_cache()` to decorate a function that opens and resizes to 16x16 if you didn't want to change your algorithm much... https://docs.python.org/3/library/functools.html – Mark Setchell Jan 01 '23 at 10:52
  • I'm not too bothered by changing my code. If need be I'll change languages. – nikola markoski Jan 01 '23 at 10:55
  • If your images are JPEGs, you should definitely use the *"shrink-on-load"* feature, a.k.a. *"draft"* mode in PIL, because you are resizing really small https://stackoverflow.com/a/57717879/2836621 – Mark Setchell Jan 01 '23 at 10:55
  • **What do you mean by "compare"**? What is the actual rule that needs to be implemented? If I have some images as input for this code, how can I even know what I should expect as the output? – Karl Knechtel Jan 01 '23 at 12:32
  • you're just asking about micro-optimizations, which is the wrong approach here. you need to reduce the complexity of the entire operation. https://en.wikipedia.org/wiki/Content-based_image_retrieval -- reduce the search space by not even considering pairs that are blatantly non-similar. use some type of "content" hashing. AI can help (cut off the classification layer, you get a feature layer). there's also "perceptual hashing" – Christoph Rackwitz Jan 02 '23 at 14:09
  • @ChristophRackwitz Yes, I am asking for optimisations to my code. I didn't say this is the original question, because I didn't think it mattered, but I manually sorted around 40000 images into 8 different categories. So I already reduced the runtime from ~5 months to 9-10 days – nikola markoski Jan 05 '23 at 19:04
  • 1
    you wanna be done in less than an hour, basically the time it takes to read all those images? because you can, if you don't insist on sticking to your approach – Christoph Rackwitz Jan 06 '23 at 18:48
  • @ChristophRackwitz I looked into what perceptual hashing was. I implemented it into my code, expecting it to be slower. The first few runs were slower, because I didn't calibrate it properly, but once calibrated the hashing was 10x faster and found more similar images. **Thank you** – nikola markoski Jan 07 '23 at 08:35

2 Answers2

1

Following on from my comments, I'm suggesting that loading and resizing take the most time, so that is where I'd aim to optimise.

I don't have a Python interpreter available at the moment to test properly, but along these lines:

from functools import lru_cache

@lru_cache(maxsize=None)
def loadImage(filename)
    im = Image.open(filename)
    im = im.resize((16,16))
    return im

That should already make a massive difference. Then adjust to use "draft" mode, something like:

    im = Image.open(filename)
    im.draft('RGB',(32,32))
    im = im.resize((16,16)
    return im

You could also multithread the loading if your laptop has a decent CPU.

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
0

Your problem is quite strange, though. The fact that you have to read the data itself in order to compare is not something that should happen most of the time, and it would make the most sense that you have some metadata to compare by.


That said, here are some very different approaches to speeding this up.

  1. You could parallelize the code to achieve a speedup of the number of your cores.
  2. You could change rmse to simple abs(diff) or another cheaper distance function. to save a lot of calculations runtime.
  3. You could write your own diff method, to stop calculating when passing some diff threshold. This requires the function to be compiled or at least just in time compiled.
  4. If you could pre-calculate some dimensionality reduction for each image, you could perform the comparison in a lower dimension. For example, sum the rows, and get a column of the sums for each image. Compare that column instead of the entire image. Then only calculate on the entire image for images that came out with similar lower dimension representations.
  5. If a lot of your images are the same, you could group them, then when comparing against any image in a group, you won't have to do it again for all the other images in the group.
  6. A quick speedup can probably be obtained by correctly using cdist, calculating all-to-all distances
Gulzar
  • 23,452
  • 27
  • 113
  • 201