2

I'm using Imagehash in Python to generate 48-digit hexadecimal hashes of around 30,000 images, which I'm storing in a list of dictionaries (the phashes as well as some other image properties). For example:

[{"name":"name1", "phash":"a12a5e81127d890a7c91897edc752b506657233f56c594b7e6575e24e457d465"},
 {"name":"name2", "phash":"a1aa7e011367812a7c9181be9975a9e86657239f3ec09697e6565a24e50bf477"}
 ...
 {"name":"name30000", "phash":"a1aa7e05136f810afc9181ba9951a9686617239f3ec4d497e6765a04e52bfc77"}]

I then have video input from a Raspberry Pi which is phashed, and that hash is compared to this database (given the nature of the Pi camera, the test hash from the video stream will NOT ever match the hashes in the database). Right now I'm doing a dumb loop, which takes about 5 seconds to loop through and check the Hamming distance of each of the ~30,000 pre-calculated hashes, which is way too slow. The Imagehash library I'm using means that the Hamming distance can be calculated simply by doing dbHash1 - testHash. Apparently sorting and doing bisect is not the way to approach this, since sorting is irrelevant to Hamming distances. So, I assume there must be a faster way to get this done? I've read this question regarding metric spaces, but I wanted to check if there's a (relatively) simple Python implementation that someone knows of.

Community
  • 1
  • 1
IronWaffleMan
  • 2,513
  • 5
  • 30
  • 59
  • Ah, I will clarify this in the OP; the test hash I'll be looking up will definitely not match any of the database hashes. I'm looking for the one with the smallest Hamming distance. – IronWaffleMan Sep 20 '16 at 03:18
  • There are a number of data structures that support searching a metric space, see this SO question: http://stackoverflow.com/questions/6389841/efficiently-find-binary-strings-with-low-hamming-distance-in-large-set – AChampion Sep 20 '16 at 03:40
  • I've seen that one, but I don't know how to implement any of those, and once implemented, how to search them. – IronWaffleMan Sep 20 '16 at 03:48

2 Answers2

2

I got an answer from the guy behind ImageHash, Johannes Buchner.

I can store the DB as a 2d matrix:

arr = []
for dbHash in db:
    arr.append(dbHash.hash.flatten())
arr = numpy.array(arr)

Then I can do the comparison against all at the same time:

binarydiff = arr != testhash.hash.reshape((1,-1))
hammingdiff = binarydiff.sum(axis=1)
closestdbHash_i = numpy.argmin(hammingdiff)
closestdbHash = db[closestdbHash_i]
IronWaffleMan
  • 2,513
  • 5
  • 30
  • 59
-1

Scipy's pairwise distance function supports Hamming distances. I'd try that.

Sohier Dane
  • 152
  • 1
  • 8
  • I already know how to get the Hamming distance from my test hash and all the hashes in the database. What I need is a quick way of doing it that doesn't involve 30,000 loop iterations. – IronWaffleMan Sep 20 '16 at 03:58
  • Scipy's implementation is written in C++ and likely to be fairly optimized. While still reliant on running direct pairwise comparisons, it should be meaningfully faster than something we would code up as a one off. I should have posted that note as a comment though, since it doesn't answer your main question about metric spaces. – Sohier Dane Sep 20 '16 at 17:13