1

The key idea of Locality sensitive hashing (LSH) is that neighbor points, v are more likely mapped to the same bucket but points far from each other are more likely mapped to different buckets. In using Random projection, if the the database contains N samples each of higher dimension d, then theory says that we must create k randomly generated hash functions, where k is the targeted reduced dimension denoted as g(**v**) = (h_1(v),h_2(v),...,h_k(v)). So, for any vector point v, the point is mapped to a k-dimensional vector with a g-function. Then the hash code is the vector of reduced length /dimension k and is regarded as a bucket. Now, to increase probability of collision, theory says that we should have L such g-functions g_1, g_2,...,g_L at random. This is the part that I do not understand.

Question : How to create multiple hash tables? How many buckets are contained in a hash table?

I am following the code given in the paper Sparse Projections for High-Dimensional Binary Codes by Yan Xia et. al Link to Code

In the file Coding.m

dim = size(X_train, 2);
R = randn(dim, bit);

% coding
B_query = (X_query*R >= 0);
B_base = (X_base*R >=0);   

X_query is the set of query data each of dimension d and there are 1000 query samples; R is the random projection and bit is the target reduced dimensionality. The output of B_query and B_base are N strings of length k taking 0/1 values. Does this way create multiple hash tables i.e. N is the number of hash tables? I am confused as to how. A detailed explanation will be very helpful.

halfer
  • 19,824
  • 17
  • 99
  • 186
SKM
  • 959
  • 2
  • 19
  • 45
  • I apologise for the downvote, but this is the third question of yours I've found that has had substantial edits made to it after an answer is received. As I said on another answer, please broadly avoid doing this at all, but if you need to on rare occasions, please add a justification in the edit remarks. – halfer Jun 19 '16 at 22:12

1 Answers1

1

How to create multiple hash tables?

LSH creates hash-table using (amplified) hash functions by concatenation:

g(p) = [h1(p), h2(p), · · · , hk (p)], hiR H

g() is a hash function and it corresponds to one hashtable. So we map the data, via g() to that hashtable and with probability, the close ones will fall into the same bucket and the non-close ones will fall into different buckets.

We do that L times, thus we create L hashtables. Note that every g() is/should most likely to be different that the other g() hash functions.

Note: Large k ⇒ larger gap between P1, P2. Small P1 ⇒ larer L so as to find neighbors. A practical choice is L = 5 (or 6). P1 and P2 are defined in the image below:

enter image description here

How many buckets are contained in a hash table?

Wish I knew! That's a difficult question, how about sqrt(N) where N is the number of points in the dataset. Check this: Number of buckets in LSH

The code of Yan Xia

I am not familiar with that, but from what you said, I believe that the query data you see are 1000 in number, because we wish to pose 1000 queries.

k is the length of the strings, because we have to hash the query to see in which bucket of a hashtable it will be mapped. The points inside that bucket are potential (approximate) Nearest Neighbors.

Daniel Node.js
  • 6,734
  • 9
  • 35
  • 57
gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • Thanks once more. THe part that is unclear is the number of hash codes, L which become the number ofhash tables. I seem to be missing on what is L. If 2 symbols are used 0 and 1 then how does one choose L? – SKM May 26 '16 at 17:10
  • In the code by Yan Xia, I do not see how they create L hash tables. THe Querydata are 1000 samples of 960 dimension. So, how is this connected to L? – SKM May 26 '16 at 17:12
  • The number of hash codes is `k`. `L` is the number of the hash functions `g()`. As mentioned in my answer @SKM, L is usually 5 or 6, it's not a function of `k`, is that is what you mean. One query is related to `L`, only in the fact that it hashed `L` times, one for each `g()` hash function. – gsamaras May 26 '16 at 17:19
  • 1
    Can you please mention the reason why L = 5 or 6? Based on my understanding from your explanation, if Number of Hash tables = Number of g() functions, then if k =2 is the reduced dimensionality,then I should have the following hash codes g1 = (01); g2 = (00); g3 = (10);g4=(11). i.e., 2^k combinations, each combination is a hash table. Please correct me if wrong. – SKM May 26 '16 at 17:45
  • 1
    @SKM `L` can be more (or even less) than 5 or 6. However, **in practice** it is said that LSH works well with `L` equal to 5 or 6. I do not know if `k` is called "reduced dimensionality". However, in your example, you are correct, i.e. for `L` = 4, we would have 4 hashtables, each one with the `g()` functions you mentioned. – gsamaras May 26 '16 at 17:51
  • One last Question, will be greatful for your insights. If there are N data points, then will there be L hash tables / hash codes of string length k for each data point? So, in total will there be N*L hash tables? From the example in my commen it seems that longer the hash code, more are the hash tables since L = 2^k where k is the length of the hash code. – SKM May 26 '16 at 18:28
  • @SKM **no**. The hash tables are `L`. So the number of hashtables is *not* a function of `k` or `N`. `k` modifies the length of the hash function `g()`. Hope this helps. :) – gsamaras May 26 '16 at 18:36
  • If possible, could you please provide a code snippet in Matlab showing how to create L hash tables for LSH using random projection ? This will help to get a better picture of your explanation in your answer. In hte Question that I posted, the output of B_query is 1000 points each of length bit representing 1000 hash codes. – SKM May 26 '16 at 20:05
  • 1
    I haven't wrote Matlab for a long @SKM, sorry. Nevertheless, that's another question, so you might want to post a new question! :) – gsamaras May 26 '16 at 21:56