0

I have a number of blockhash based fingerprints of images stored in a table along with the web site member who uploaded it and the local URL of the image:

member  varchar(8)  
fingerprint    char(64)
url     varchar(80) 

I am trying to do a Hamming Distance calculation on these hashes to determine how likely they are to be a match (reference mysql hamming distance between two phash).

Given the easiest way I have seen to do this is to use MySql's bit_count function to XOR the two and produce a total number of on bits, I know that I have to break the 64-character hash into 4 chunks, then convert each to unsigned integer before feeding it to bit_count. So, I have a query that does such (run from Linux command line, hence the argument variables):

select bit_count(cast(conv(substr('$1', 1, 16), 16, 10) as unsigned) ^ cast(conv(substr($2, 1, 16), 16, 10) as unsigned)) + 
bit_count(cast(conv(substr('$1', 17, 16), 16, 10) as unsigned) ^ cast(conv(substr('$2', 17, 16), 16, 10) as unsigned)) +
bit_count(cast(conv(substr('$1', 33, 16), 16, 10) as unsigned) ^ cast(conv(substr('$2', 33, 16), 16, 10) as unsigned)) +
bit_count(cast(conv(substr('$1', 49, 16), 16, 10) as unsigned) ^ cast(conv(substr('$2', 49, 16), 16, 10) as unsigned));

..And this produces the proper results between two fingerprints.

However, I need a query that will find matching fingerprints from anyone other than the member in question. Basically:

select member, url 
from images 
where (Hamming Distance between <fingerprint> and (select hashes from member)  < 10) 
    AND member != "<value>"

I think I might want to create a stored procedure to determine the Hamming Distance, then maybe limit the results I have to check from the entire database to something like those matching the first 10 characters. But is there a better way?

Community
  • 1
  • 1
Joe T
  • 11
  • 2

2 Answers2

1

A hamming_distance stored function is a good idea. Then you can use it in a join.

SELECT i1.member, i1.url
FROM images AS i1
JOIN images AS i2 ON i1.member != i2.member AND hamming_distance(i1.fingerprint, i2.fingerprint) < 10
WHERE i2.member = @member_in_question
Barmar
  • 741,623
  • 53
  • 500
  • 612
1

The function was the trick. It chunks the fingerprint and returns the distance between the two. Then it was a simple matter of a select:

select member, url, fingerprint, hamming_dist(fingerprint, '$fingerprint') as distance from images where hash REGEXP '$find' && hamming_dist(hash, '$hash') < 8 && member != '$member';"

The REGEXP just limited the search down to probable matches, it consisted of the first and last character in the fingerprint. Doing that dropped the query time down from .35 seconds to .12 seconds.

Thanks for the help!

Joe T
  • 11
  • 2