Implementation of TextRank algorithm using Spark(Calculating cosine similarity matrix using spark)

Question

I am trying to implement textrank algorithm where I am calculating cosine-similarity matrix for all the sentences.I want to parallelize the task of similarity matrix creation using Spark but don't know how to implement it.Here is the code:

    cluster_summary_dict = {}
    for cluster,sentences in tqdm(cluster_wise_sen.items()):
        sen_sim_matrix = np.zeros([len(sentences),len(sentences)])
        for row in range(len(sentences)):
            for col in range(len(sentences)):
                if row != col:
                    sen_sim_matrix[row][col] = cosine_similarity(cluster_dict[cluster]  
                                               [row].reshape(1,100), cluster_dict[cluster] 
                                               [col].reshape(1,100))[0,0]
        sentence_graph = nx.from_numpy_array(sen_sim_matrix)
        scores = nx.pagerank(sentence_graph) 
        pagerank_sentences = sorted(((scores[k],sent) for k,sent in enumerate(sentences)), 
                             reverse=True)
        cluster_summary_dict[cluster] = pagerank_sentences

Here,cluster_wise_sen is a dictionary that contains list of sentences for different clusters ({'cluster 1' : [list of sentences] ,...., 'cluster n' : [list of sentences]}). cluster_dict contains the 100d vector representation of the sentences. I have to compute the sentence similarity matrix for each cluster. Since it is time consuming, therefore looking to parallelize it using spark.

Are you looking exclusively on using spark or do you want to parallelize whatever the technical solution? — gdupont, Jul 20 '20 at 08:29
I can use other method also. It need not necessarily be Spark. — Jayesh Dubey, Jul 20 '20 at 09:26

score 1 · Accepted Answer · answered Jul 20 '20 at 16:24

1

The experiments with large scale matrix calculation for cosine similarity are well written in here!

To achieve speed and not compromising much on the accuracy, you can also try hashing methods like Min-Hash and evaluate Jaccard Distance similarity. It comes with a nice implementation with Spark ML-lib, the documentation has very detailed examples for reference: http://spark.apache.org/docs/latest/ml-features.html#minhash-for-jaccard-distance

answered Jul 20 '20 at 16:24

Abhinav

81
2

Great - Which method did you implement? – Abhinav Jul 23 '20 at 11:46
The first link. – Jayesh Dubey Jul 24 '20 at 07:09

Implementation of TextRank algorithm using Spark(Calculating cosine similarity matrix using spark)

1 Answers1