7

I am using approxSimilarityJoin to find Jaccard similarity between two sets.

 val dfA = hashtoseq.toDF("id","values") //values is a set of string
 val hashingTF = new HashingTF().setInputCol("values").setOutputCol("features").setNumFeatures(1048576)
 val featurizedData = hashingTF.transform(dfA)


 val mh = new MinHashLSH()
             .setNumHashTables(3)
             .setInputCol("features")
             .setOutputCol("hashes")

val model = mh.fit(featurizedData)

val dffilter = model.approxSimilarityJoin(featurizedData, featurizedData, 0.45)

I am getting shuffle write around 270 GB for 16 GB data set and taking more than 3 hrs even on servers(3 worker nodes, each node has 64 GB RAM and 64 cores).

I went through following links:-
[ LSH Spark stucks forever at approxSimilarityJoin() function, but it did not work for me.

I have also gone through databricks website where they have compared runtime with the size of data.For data in MB i.e 436 MB, approxSimilarityJoin taking 25 minutes.For data set like in GB, its creating problem. [https://databricks.com/blog/2017/05/09/detecting-abuse-scale-locality-sensitive-hashing-uber-engineering.html].

Can we reduce this shuffle write by making some change in the code/ server configuration or there is problem with approxSimilarityJoin function ? Is there any other efficient way to compute Jaccard Similarity on large datasets?

Rajjat Dadwal
  • 183
  • 1
  • 18
  • any solution you found ? I am also with the same kind of issues. – dks551 Jan 11 '19 at 15:42
  • I have posted a related solution in my this post. [link](https://stackoverflow.com/questions/49185464/jaccard-similarity-of-an-rdd-with-the-help-of-spark-and-scala-without-cartesian) – Rajjat Dadwal Jan 21 '19 at 22:49

0 Answers0