What value to use for numHashTable in Spark LSH by Uber?

Question

I'm trying to use .approxSimilarityJoin of Spark MLlib LSH: MinHash for Jaccard Distance e.g.

val mh = new MinHashLSH()
    .setNumHashTables(5)
    .setInputCol("features")
    .setOutputCol("hashes")

I understand that the higher the numHashTables, the more accurate the system, and the more complex/slow the calculation. I have two questions about the parameters:

What's the relationship between numHashTables and the MinHash fingerprint size?
How do I set the value correctly?

NOTE: I believe that the algorithm has been added to MLlib by Uber: https://eng.uber.com/lsh/

This is the old question but just to answer your second part of the question quickly, the value you'll need to set is a compromise between your computational resources and the accuracy of your model. Think of it as if you are comparing Precision and Recall. — eliasah, Feb 22 '18 at 14:08

score 0 · Answer 1 · answered Apr 12 '22 at 06:52

0

I think numHashTables is just the MinHash fingerprint size. numHashTables may be a experience parameter, It depends on your scene, and b * r = numHashTables (r=1,recently)

answered Apr 12 '22 at 06:52

min fan

1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Apr 12 '22 at 11:07

What value to use for numHashTable in Spark LSH by Uber?

1 Answers1