I'm trying to use .approxSimilarityJoin
of Spark MLlib LSH: MinHash for Jaccard Distance e.g.
val mh = new MinHashLSH()
.setNumHashTables(5)
.setInputCol("features")
.setOutputCol("hashes")
I understand that the higher the numHashTables, the more accurate the system, and the more complex/slow the calculation. I have two questions about the parameters:
- What's the relationship between numHashTables and the MinHash fingerprint size?
- How do I set the value correctly?
NOTE: I believe that the algorithm has been added to MLlib by Uber: https://eng.uber.com/lsh/