How can i identify duplicate image that has a different size

Question

The problem is that iv'e got a folder with more than 80k images and about 40% of them are duplicate. (some of the pictures are rotated, some have different size, but still its the same image).

At first I used hashing algorithm (with c++/java) to delete all the duplicate images(that have the same size and other properties). But it seems it didnt delete all of them because some picture has a difrrent size (but are visually identical)

iv'e searched alot on the net to find any efficnt algoritam for this problem

the best code i found for my problem is with pHash, but its outdated and isn't working with VS anymore.

if someone have an idea for me, it will be awesome.

thanks

This may help https://stackoverflow.com/a/25204466/2836621 – Mark Setchell Nov 01 '17 at 23:03 — Mark Setchell, Nov 01 '17 at 23:03

score 2 · Answer 1 · answered Nov 01 '17 at 23:06

In addition to the hashing algorithm, you could calculate the histogram for each image and then compare them

In rotated images histogram should be exactly the same, for resized images it should be very similar.

Here there's an example of histogram comparison using OpenCV.

I still suggest to use hashing in first place because it should be way more fast and remove the first set of duplicates, then refines it using histogram comparison.

How can i identify duplicate image that has a different size

1 Answers1