0

I need to run a PySpark application (v1.6.3). There is the --py-files flag to add .zip, .egg, or .py files. If I had a Python package/module at /usr/anaconda2/lib/python2.7/site-packages/fuzzywuzzy, how would I include this whole module?

Inside this directory, I do notice some *.py and *.pyc files.

  • fuzz.py
  • process.py
  • StringMatcher.py
  • string_processing.py
  • utils.py

Would I have to include each of these one-by-one? For example.

spark-submit \
 --py-files /usr/anaconda2/lib/python2.7/site-packages/fuzzywuzzy/fuzz.py,/usr/anaconda2/lib/python2.7/site-packages/fuzzywuzzy/process.py,/usr/anaconda2/lib/python2.7/site-packages/fuzzywuzzy/StringMatcher.py,/usr/anaconda2/lib/python2.7/site-packages/fuzzywuzzy/string_processing.py,/usr/anaconda2/lib/python2.7/site-packages/fuzzywuzzy/utils.py

Is there an easier way?

  • should I try to find the .egg or .zip and use it (e.g. pypi)?
  • can I just zip up this directory and pass that in?

Any tips or pointers would be greatly appreciated. In reality, there are more Python modules managed by conda that I need.

Jane Wayne
  • 8,205
  • 17
  • 75
  • 120
  • Possible duplicate of [Easiest way to install Python dependencies on Spark executor nodes?](https://stackoverflow.com/questions/29495435/easiest-way-to-install-python-dependencies-on-spark-executor-nodes) – Jane Wayne Jun 26 '17 at 16:43

1 Answers1

0

I suggest doing it in other direction. Installing pyspark to Anaconda with:

conda install -c conda-forge pyspark=2.1.1
Piotr Kalański
  • 669
  • 1
  • 5
  • 8
  • I don't think that will work (haven't tried it). But, just thinking about it, why would installing pyspark to conda environment help with installing a third party library being available to the cluster at run-time? – Jane Wayne Jun 26 '17 at 12:48
  • After installing pyspark to conda environment you will be able to use Spark together with other packages installed with Anaconda, including standard Anaconda packages and additionally installed packages. – Piotr Kalański Jun 26 '17 at 16:45
  • Is `pyspark=2.1.1` for Spark `1.6.3`? Or shall I use `pyspark=1.6.3`. – Jane Wayne Jun 26 '17 at 19:26
  • If you need Spark 1.6.3 you should use pyspark=1.6.3 – Piotr Kalański Jun 26 '17 at 22:44