Referring to the answer on another question I am looking for clarification of the process of LSH analysis. Suppose I have sparse feature vectors (binary, mostly 0) and would like to use cosine distance as the measure with a threshold alpha, which might vary.
My first step is to compute the hash for each of the vectors. Does distance measure matter? (I suppose yes). Does threshold matters? (I suppose no). How can I find the appropriate hash-function?
If programming, I would have function like:
bytes[] getHash(Vector featureVec)
Then I would put results in the
Map(long vectorId, bytes[] hashcode) <-vectorHashMap
Then I make hash table from hashes (putting hashs into bins). I suppose at least here should the threshold matter. How can I do that?
If programming, it would be like:
Map,Map createHashTable(Map vectorHashMap, long threshold)
which returns two maps:
Map of (hashCode, bucketId)
andMap of (bucketId, ListOfVectorIds)
.Then i could easily retrieve the neigbors having vectorId as input and a list of vectorIds as output.