I am trying to implement textrank algorithm where I am calculating cosine-similarity matrix for all the sentences.I want to parallelize the task of similarity matrix creation using Spark but don't know how to implement it.Here is the code:
cluster_summary_dict = {}
for cluster,sentences in tqdm(cluster_wise_sen.items()):
sen_sim_matrix = np.zeros([len(sentences),len(sentences)])
for row in range(len(sentences)):
for col in range(len(sentences)):
if row != col:
sen_sim_matrix[row][col] = cosine_similarity(cluster_dict[cluster]
[row].reshape(1,100), cluster_dict[cluster]
[col].reshape(1,100))[0,0]
sentence_graph = nx.from_numpy_array(sen_sim_matrix)
scores = nx.pagerank(sentence_graph)
pagerank_sentences = sorted(((scores[k],sent) for k,sent in enumerate(sentences)),
reverse=True)
cluster_summary_dict[cluster] = pagerank_sentences
Here,cluster_wise_sen is a dictionary that contains list of sentences for different clusters ({'cluster 1' : [list of sentences] ,...., 'cluster n' : [list of sentences]}). cluster_dict contains the 100d vector representation of the sentences. I have to compute the sentence similarity matrix for each cluster. Since it is time consuming, therefore looking to parallelize it using spark.