5

I can create a Spark cluster on AWS as described here.

However, my own Python code and pip libraries need to run on the master and workers. This is a lot of code, and the pip installation process compiles some native libraries as well, so I can't simply have Spark distribute this code at runtime using techniques such as registering a pip requirements file with spark_context or the --py-files argument of spark-submit.

Of course I could run a bash script right after running aws emr create-cluster, but I wonder if there is a more automatic way, so that I can avoid the maintenance of a big bash script for installation.

So, what is the best way to set up clusters to include my code and dependencies?

Joshua Fox
  • 18,704
  • 23
  • 87
  • 147
  • you could write/run a json file to execute the bash script (to make it look automatic), I had a similar problem and have not found another way yet. – GameOfThrows Sep 08 '15 at 10:29
  • I believe this question is covered in http://stackoverflow.com/questions/24686474/shipping-python-modules-in-pyspark-to-other-nodes – Erik Schmiegelow Sep 08 '15 at 15:04
  • Possible duplicate of [shipping python modules in pyspark to other nodes?](http://stackoverflow.com/questions/24686474/shipping-python-modules-in-pyspark-to-other-nodes) – Paul Sweatte Apr 28 '17 at 14:11

0 Answers0