I have a bunch of images (from the M.C. Escher collection) i want to organize, so first step i had in mind is to group them up, by comparing them (you know, some have different resolutions/shapes, etc).
i wrote a very brutal script to: * read the files * compute their histograms * compare them
but the quality of the comparison is really low, like there are files matching that are absolutely different
take a look at what i wrote so far:
Preparing the histograms
files_hist = {}
for i, f in enumerate(files):
try:
frame = cv2.imread(f)
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
hist = cv2.calcHist([frame],[0],None,[4096],[0,4096])
cv2.normalize(hist, hist, alpha=0, beta=1, norm_type=cv2.NORM_MINMAX)
files_hist[f] = hist
except Exception as e:
print('ERROR:', f, e)
Comparing the histograms
pairs = list(itertools.combinations(files_hist.keys(), 2))
for i, (f1, f2) in enumerate(pairs):
correl = cv2.compareHist(files_hist[f1], files_hist[f2], cv2.HISTCMP_CORREL)
if correl >= 0.999:
print('MATCH:', correl, f1, f2)
now, for example i get a match for these 2 files:
m._c._escher_244_(1933).jpg
and
m._c._escher_208_(1931).jpg
and their correlation, using the code above, is 0.9996699595530539
(so their practically the same :( )
what am i doing wrong? how can i improve that code to avoid this false matches?
thanks!