I've been experimenting with Word2Vec and PySpark recently.
Although I can train a very large model within 20 minutes, using almost all 64 cores of a cloud instance, getting all words and their synonyms take almost 3 hours.
When I try to map lambda x: model.findSynonyms(x)
within the sentences RDD, I get the following error:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation.SparkContext can only be used on the driver, not in code that it run on workers.
In this exact use case, I have to find the 100 most similar words of every word of the vocabulary. Is there any way I can do this faster than finding one synonym at a time?