4

I don't have nltk installed on worker nodes, but trying to ship the library through pyspark, here is sample code:

   import pyspark 
   sc = pyspark.SparkContext()
   sc.addPyFile("mymodule.py") 
   sc.addPyFile("nltk-3.1-py3-none-any.whl")
   # sc.addFile("nltk-3.1-py3-none-any.whl")
   # sc.addPyFile("nltk.py") 
   sc.addFile("stop_words.txt")

   def func(raw_text):
       import os
       os.system("pip install nltk-3.1-py3-none-any.whl");
       import nltk
       import mymodule    
       cleaned_text = mymodule.clean_text(raw_text)
       return pd.Series({'cleaned_text':cleaned_text})

   dummy_rdd = sc.parallelize(emails) 
   ret_list = dummy_rdd.map(func).collect()

it still throws error from worker node --- No module named 'nltk'

anyone knows how to install nltk on the fly while using pyspark like this?

here is some relevant discussion when I googled: shipping python modules in pyspark to other nodes?

Xcodian Solangi
  • 2,342
  • 5
  • 24
  • 52
user3527917
  • 327
  • 2
  • 10
  • 22

0 Answers0