I don't have nltk installed on worker nodes, but trying to ship the library through pyspark, here is sample code:
import pyspark
sc = pyspark.SparkContext()
sc.addPyFile("mymodule.py")
sc.addPyFile("nltk-3.1-py3-none-any.whl")
# sc.addFile("nltk-3.1-py3-none-any.whl")
# sc.addPyFile("nltk.py")
sc.addFile("stop_words.txt")
def func(raw_text):
import os
os.system("pip install nltk-3.1-py3-none-any.whl");
import nltk
import mymodule
cleaned_text = mymodule.clean_text(raw_text)
return pd.Series({'cleaned_text':cleaned_text})
dummy_rdd = sc.parallelize(emails)
ret_list = dummy_rdd.map(func).collect()
it still throws error from worker node --- No module named 'nltk'
anyone knows how to install nltk on the fly while using pyspark like this?
here is some relevant discussion when I googled: shipping python modules in pyspark to other nodes?