10

I have a process, similar to tineye that generates perceptual hashes, these are 32bit ints.

I intend to store these in a sql database (maybe a nosql db) in the future

However, I'm stumped at how I would be able to retrieve records based on the similarity of hashes.

Any Ideas?

oPless
  • 618
  • 1
  • 8
  • 18
  • Probably going to need more information: are you considering the Hamming distance of the binary representations, or something else? – nickgrim Mar 16 '12 at 15:27
  • I'll be considering the hamming distance of the hash I have from the image, to the hashes stored in the database – oPless Apr 17 '12 at 17:19
  • What I meant was: unless I'm misunderstanding, Hamming distance is a property of a pair of *strings*, and you've got *ints*. How are you stringifying your ints - 32 1s and 0s, or some other way? – nickgrim Apr 17 '12 at 17:40
  • You're confusing the terminology of maths with the string type. A integer can be seen as a list of '0' and '1' elements. A 'string' type is just a list of bytes (let's not get into DBCS and unicode). – oPless May 15 '12 at 00:56

3 Answers3

22

A common approach (at least common to me) is to divide your hash bit string in several chunks and query on these chunks for an exact match. This is a "pre-filter" step. You then can perform a bitwise hamming distance computation on the returned results which should be only a smaller subset of your overall dataset. This can be done using data files or SQL tables.

So in simple terms: Say you have a bunch of 32 bits hashes in a DB and that you want to find every hash that are within a 4 bits hamming distance of your "query" hash:

  1. Create a table with four columns: each will contain an 8 bits (as a string or int) slice of the 32 bits hashes, islice 1 to 4.

  2. Slice your query hash the same way in qslice 1 to 4.

  3. Query this table such that any of qslice1=islice1 or qslice2=islice2 or qslice3=islice3 or qslice4=islice4. This gives you every DB hash that are within 3 bits (4 - 1) of the query hash. It may contain more results, and this is why there is a step 4.

  4. For each returned hash, compute the exact hamming distance pair-wise with you query hash (reconstructing the index-side hash from the four slices)

The number of operations in step 4 should be much less than a full pair-wise hamming computation of your whole table.

This approach was first described afaik by Moses Charikar in its "simhash" seminal paper and the corresponding Google patent:

  1. APPROXIMATE NEAREST NEIGHBOR SEARCH IN HAMMING SPACE

[...] Given bit vectors consisting of d bits each, we choose N = O(n 1/(1+ ) ) random permutations of the bits. For each random permutation σ, we maintain a sorted order O σ of the bit vectors, in lexicographic order of the bits permuted by σ. Given a query bit vector q, we find the approximate nearest neighbor by doing the following:

For each permutation σ, we perform a binary search on O σ to locate the two bit vectors closest to q (in the lexicographic order obtained by bits permuted by σ). We now search in each of the sorted orders O σ examining elements above and below the position returned by the binary search in order of the length of the longest prefix that matches q.

Monika Henziger expanded on this in her paper "Finding near-duplicate web pages: a large-scale evaluation of algorithms":

3.3 The Results for Algorithm C

We partitioned the bit string of each page into 12 non- overlapping 4-byte pieces, creating 20B pieces, and computed the C-similarity of all pages that had at least one piece in common. This approach is guaranteed to find all pairs of pages with difference up to 11, i.e., C-similarity 373, but might miss some for larger differences.

NB: C-similarity is the same as the Hamming distance: The Hamming distance is the number of positions at which the corresponding bits differ while C-similarity is the number of positions at which the corresponding bits agree.

This is also explained in the paper Detecting Near-Duplicates for Web Crawling by Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma:

  1. THE HAMMING DISTANCE PROBLEM

Definition: Given a collection of f -bit fingerprints and a query fingerprint F, identify whether an existing fingerprint differs from F in at most k bits. (In the batch-mode version of the above problem, we have a set of query fingerprints instead of a single query fingerprint)

[...] Intuition: Consider a sorted table of 2 d f -bit truly random fingerprints. Focus on just the most significant d bits in the table. A listing of these d-bit numbers amounts to “almost a counter” in the sense that (a) quite a few 2 d bit- combinations exist, and (b) very few d-bit combinations are duplicated. On the other hand, the least significant f − d bits are “almost random”.

Now choose d such that |d − d| is a small integer. Since the table is sorted, a single probe suffices to identify all fingerprints which match F in d most significant bit-positions. Since |d − d| is small, the number of such matches is also expected to be small. For each matching fingerprint, we can easily figure out if it differs from F in at most k bit-positions or not (these differences would naturally be restricted to the f − d least-significant bit-positions).

The procedure described above helps us locate an existing fingerprint that differs from F in k bit-positions, all of which are restricted to be among the least significant f − d bits of F. This takes care of a fair number of cases. To cover all the cases, it suffices to build a small number of additional sorted tables, as formally outlined in the next Section.

PS: Most of these fine brains are/were associated with Google at some level or some time for these, FWIW.

Philippe Ombredanne
  • 2,017
  • 21
  • 36
  • 1
    This is really a good discussion on the problem, I'm going to have to reassign the accepted answer to this one from @David Manheim's original answer – oPless Dec 08 '17 at 17:45
  • @oPless Thanks! the approach I describe here is considered as state of the art AFAIK, except possibly for http://www.cse.unsw.edu.au/~weiw/files/SSDBM13-HmSearch-Final.pdf that can beat it in some cases – Philippe Ombredanne Dec 08 '17 at 19:19
  • 2
    @PhilippeOmbredanne Your link is 404. Use the archive link instead: https://web.archive.org/web/20170309155017/http://www.cse.unsw.edu.au/~weiw/files/SSDBM13-HmSearch-Final.pdf – luckydonald Oct 01 '19 at 22:12
  • 2
    I think this should not be the accepted answer. Probably the papers/patents are OK but the solution won't work. Now let me show how it yields a hash with *24 bits* of Hamming distance: StoredHash=[0xFF],[0xFF],[0xFF],[0xFF], QueryHash= [0xFF],[0x00],[0x00], [0x00]. The condition is satisfied by q_slice1=i_slice1, hence we get the hash with 24 bits of distance selected. – A. Genchev Jun 03 '20 at 20:42
  • @A.Genchev This is a "pre-filtering" step, so in your case it will yield things that can be up to 24 bits of hamming distance with a guarantee that it will catch anything within 3 bits or less of hamming distance. Once you get that subset, you do a pair-wise actual hamming distance comparison to get the real values. Just using the pre-filter step alone will not be enough. – Philippe Ombredanne Jun 06 '20 at 08:06
  • @Philippe Ombredanne I don't understand you. I have shown that it catches everything within *24* bits of hamming distance and you tell me that within 3 bits ? I think you have mistaken bits with bytes. Each byte has 8-bits. 3 bits means *less* than *half* byte difference and I have shown that this criteria allows *3* bytes difference, which is huge! – A. Genchev Jun 06 '20 at 11:07
  • @A.Genchev if you have a hash of 32 bits that you break in 4 chunks of 8 bits, then the search procedure of pre-filtering on 8 bits will guarantee that you find at least all the strings within a 3 bits hamming distance. And you will also get also things that are matching less and also more as you said, which is why you still want to perform a pair-wise hamming computation on that pre-filtering step. The 3 comes 4 pieces -1. And I really mean 3 bits hamming distance, not 3 bytes. – Philippe Ombredanne Jul 27 '20 at 21:14
  • If this is not clear to you I can dig and update the answer with a details from a private email from Monika Henziger that provided me the outline of a proof (and that was a long long time ago). There is nothing self evident in there. – Philippe Ombredanne Jul 27 '20 at 21:16
  • If Henziger talked about bits, she's deeply wrong, because 12 non- overlapping 4-byte pieces make 384bits and having only one in common makes the largest possible hamming distance to be 352 in her case. But she talks about C-similarity and NOT Hamming distance. – A. Genchev Sep 11 '20 at 20:40
  • @A.Genchev you wrote > But she talks about C-similarity and NOT Hamming distance. These are essentially the same AFAIK. The Hamming distance is the number of positions at which the corresponding bits differ while C-similarity is the number of positions at which the corresponding bits agree. – Philippe Ombredanne Sep 14 '20 at 14:04
  • 3
    I went ahead and implemented this - it works *extremely* well. I added around 3000 photos to an index using a 64-bit [phash](https://www.npmjs.com/package/sharp-phash), using four 16-bit `INT` columns with a [covering index](https://www.sqlite.org/optoverview.html#covering_indexes) in SQLite. ~98% of the returned results have the desired distance, less than 2% need further filtering. A duplicate search among all files takes ~2 seconds on my 7 years old system. This technique is crazy fast! – mindplay.dk Jan 02 '22 at 11:59
  • 2
    I have been thinking about this extensively and reading https://arxiv.org/pdf/1307.2982.pdf, a key reference. The confusion here is the wording used by Philippe Ombredanne, which is incorrect. To address @A.Genchev point, the correct wording is that it guarantees that any result within a hamming distance of 3 is included but it does not guarantee that all results have a hamming distance of 3 or fewer. As correctly demonstrated, 24 > 3 but no results <=3 will be missed – Justin Winokur Jan 24 '23 at 20:12
  • @JustinWinokur you are correct! – Philippe Ombredanne Jan 26 '23 at 08:22
  • @JustinWinokur you get it: to perform similarity search you need the results with hamming distance *less* than given number of bits, say 4. So if the accepted solution allows for error as big as 24bits, it does not represent a solution - the problem remains unsolved. That was my point. Using SQL you want to *filter out* all unwanted records, not to include them in the result. – A. Genchev Feb 12 '23 at 16:30
  • So this works great, but statistically with 12m rows and 4 unsigned tinyints from a 32 bit simhash, 4/255 of those will be returned at least, which is 188k rows, still far too many to rapidly do a full hamming distance on.... – Kevin Jul 13 '23 at 19:51
3

David's discussion is correct, but if you don't have a lot of data, check out Hamming distance on binary strings in SQL

Community
  • 1
  • 1
riothamus
  • 111
  • 5
1

To find hamming distance, you can just use bitwise addition and subtraction (& and ~ on the integers) in order to compute these.

SQL isn't made for this sort of processing. The comparisons on large data sets get very messy, and will not have the speed of a query that utilizes the strength of the system. That said, I've done similar things.

This will give you individual differences, which would need to be run on the full data set and ordered, which is messy at best. If you want it to run faster, you will need to use strategies like indexing by "region," or finding natural groupings in your data. There are umbrella clustering strategies, and similar - there is a lot of literature. It will, however, be messy in most traditional Database systems.

David Manheim
  • 2,553
  • 2
  • 27
  • 42
  • 1
    As you can see from the related questions and answers, the suggestions are almost all not using databases. – David Manheim Jun 22 '12 at 15:24
  • Yes, I wonder how the likes of tineye and google do their searches (assuming they use perceptual hashes). – oPless Aug 01 '12 at 16:15
  • 1
    @DavidManheim IMHO there is nothing in SQL that prohibits this sort of processing. See my answer for a practical approach that works in and outside a DB. – Philippe Ombredanne Nov 27 '17 at 08:19
  • @PhilippeOmbredanne I agree that it's not prohibited, and the really clever hack below notwithstanding, it's just not a great language for this. If you need to do it, you can. (And god knows I've done things in SQL horrifically ill-suited to the language because of system and regulatory constraints on taking data out of the system!) – David Manheim Nov 28 '17 at 17:09
  • 2
    FWIW, this "clever hack" as you call is considered by many as the state of the art for bounded hamming distance queries ;) .... and I agree SQL is not a panacea there (and this hack works equally if not better with sorted files and binary search) – Philippe Ombredanne Nov 28 '17 at 19:19
  • 1
    The clever hack is implementing it in SQL, not the method itself! – David Manheim Nov 30 '17 at 03:16
  • 1
    @David Manheim - I'm not sure who downvoted you, however Phillipe's answer has a deeper discussion on the issues I had. – oPless Dec 08 '17 at 17:47
  • 1
    @oPless - Absolutely agree! – David Manheim Jan 10 '18 at 15:06
  • Note that this answer explains one way to compute pair-wise hamming distance, which is needed in step 4. of my approach. – Philippe Ombredanne Jan 26 '23 at 08:34