Finding synonyms in parallel with Word2Vec and Pyspark

Question

I've been experimenting with Word2Vec and PySpark recently.

Although I can train a very large model within 20 minutes, using almost all 64 cores of a cloud instance, getting all words and their synonyms take almost 3 hours.

When I try to map lambda x: model.findSynonyms(x) within the sentences RDD, I get the following error:

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation.SparkContext can only be used on the driver, not in code that it run on workers.

In this exact use case, I have to find the 100 most similar words of every word of the vocabulary. Is there any way I can do this faster than finding one synonym at a time?

Related to[using Word2VecModel.transform() does not work in map function](https://stackoverflow.com/questions/34448456/using-word2vecmodel-transform-does-not-work-in-map-function) and [Calling Java/Scala function from a task](https://stackoverflow.com/q/31684842/8371915) — Alper t. Turker, May 25 '18 at 19:54
Thanks for the links, @user8371915. The main difference between the questions is that I don't necessarily need to use findSynonyms in a map function. Instead, I want to know if it is possible to paralelize this process. Maybe exporting the vectors and finding the cosine similarity with other tool? — Leonardo L R., May 25 '18 at 20:30
I believe that what @user8371915 is trying to point at is that it's a similar issue like using `model.transform(...)` and the way to parallelize it is written by the end of that first answer. This said, I'm closing this question as a dupe. — eliasah, May 26 '18 at 20:35

Finding synonyms in parallel with Word2Vec and Pyspark

0 Answers0