LSH and Sentence Similarity

Question

Is it possible to use LSH Spark implementation Algorithm for finding Sentence Similarity? I have approximately 16k rows in my dataset, and this is approximately 16k*16k=256000 number of different options between all rows where similarity distance has to be computed and this number is going to increase everyday. I firstly use nltk, pymorphy2, gensim libraries for some preprocessing stuff, after compute tfidf, and in the end apply idf sparse data into LSH algorithm.

That's the structure of my data

When I use my code,

def LSH(Pred_Factors):  
    brp = BucketedRandomProjectionLSH(inputCol="idf", outputCol="hashes",
                                      bucketLength=1.0, numHashTables=10)

    model = brp.fit(Pred_Factors)

    Hashed_Factors = model.transform(Pred_Factors)

    sim_table = model.approxSimilarityJoin(Hashed_Factors, Hashed_Factors,  #hashes computes anyway 
                                    threshold=1.2, distCol="EuclideanDistance") \
         .select(col("datasetA").alias("idA"),
              col("datasetB").alias("idB"),
              col("EuclideanDistance")).cache()

    return sim_table

sim_table = LSH(tfidf)

Similarity cannot be computed because of huge size of data (sparse data is so huge to compute for LSH, but at some point of view it can but it takes 20 minutes, 95-100% CPU and 3gb Memory). I even changed amount of partitions from 200 to 1000 and it doesn't help significantly. Hopefully, I found that LSH can work with Sparse data without converting it. There is (the only one way I know that works) to do this is gently --> to compute sum of idf.

##UDF SUM
sum_ = udf(lambda v: float(v.values.sum()))
idf_sum = tfidf('idf_sum', sum_('idf'))

and then I can use sum of idf and apply it to LSH and all is going great. May someone suggest the better way how to do it or just say that this is a normal way to compute Euclidian similarity between huge amount of text rows using LSH, I understand that cosine similarity is better for this purpose but Spark has implementation only Jaccard(Minhash) & Euclidian similarities of LSH algorithm. Maybe MinHash can someway ease the computation pressure?

P.S I want to stay within spark, thanks for any help, suggestions, advices :)

Check https://stackoverflow.com/questions/43938672/efficient-string-matching-in-apache-spark — fjsj, May 23 '19 at 01:39

LSH and Sentence Similarity

0 Answers0