2

I'm trying to create a forst for nearest neighbor searching but I'm not sure I'm doing it right or even if MinHash / LSH is an appropriate choice for my data. I ask this because the results are not usable.

I'm trying to follow the example in the documentation.

My Data:

512 dimensions, eg value is a bit eg 0 or 1 Is this actually usable for MinHash / LSH? And if yes, how would I construct the MinHash for each record?

As far as i understood the point of minhash is already to map the data to such a bit-structure? So i could just load the bits into it? As in h = MinHash(num_perm=512, hashvalues=listOfBits) ?

beginner_
  • 7,230
  • 18
  • 70
  • 127

1 Answers1

1

MinHash is a technique that can be used if individual data records can be described as sets (e.g. text document as set of words) and the similarity between such records is described by the Jaccard similarity of the corresponding sets.

If you really want to apply MinHash you need to first find a way to represent your bit vector of size 512 as set. A possibility would be to consider the set of bit indices with value 1. Next, you need to think if the Jaccard similarity between such sets of bit indices really makes sense and describes the similarity appropriately.

otmar
  • 386
  • 1
  • 9