4

I'm trying to use .approxSimilarityJoin of Spark MLlib LSH: MinHash for Jaccard Distance e.g.

val mh = new MinHashLSH()
    .setNumHashTables(5)
    .setInputCol("features")
    .setOutputCol("hashes")

I understand that the higher the numHashTables, the more accurate the system, and the more complex/slow the calculation. I have two questions about the parameters:

  • What's the relationship between numHashTables and the MinHash fingerprint size?
  • How do I set the value correctly?

NOTE: I believe that the algorithm has been added to MLlib by Uber: https://eng.uber.com/lsh/

Marsellus Wallace
  • 17,991
  • 25
  • 90
  • 154
  • 1
    This is the old question but just to answer your second part of the question quickly, the value you'll need to set is a compromise between your computational resources and the accuracy of your model. Think of it as if you are comparing Precision and Recall. – eliasah Feb 22 '18 at 14:08

1 Answers1

0

I think numHashTables is just the MinHash fingerprint size. numHashTables may be a experience parameter, It depends on your scene, and b * r = numHashTables (r=1,recently)

  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Apr 12 '22 at 11:07