How do I install my code and dependencies in an AWS Spark cluster?

Asked Sep 08 '15 at 09:37

Active Sep 08 '15 at 09:48

Viewed 1,229 times

I can create a Spark cluster on AWS as described here.

However, my own Python code and pip libraries need to run on the master and workers. This is a lot of code, and the pip installation process compiles some native libraries as well, so I can't simply have Spark distribute this code at runtime using techniques such as registering a pip requirements file with spark_context or the --py-files argument of spark-submit.

Of course I could run a bash script right after running aws emr create-cluster, but I wonder if there is a more automatic way, so that I can avoid the maintenance of a big bash script for installation.

So, what is the best way to set up clusters to include my code and dependencies?

edited Sep 08 '15 at 09:48

asked Sep 08 '15 at 09:37

Joshua Fox

18,704
23
87
147

you could write/run a json file to execute the bash script (to make it look automatic), I had a similar problem and have not found another way yet. – GameOfThrows Sep 08 '15 at 10:29
I believe this question is covered in http://stackoverflow.com/questions/24686474/shipping-python-modules-in-pyspark-to-other-nodes – Erik Schmiegelow Sep 08 '15 at 15:04
Possible duplicate of [shipping python modules in pyspark to other nodes?](http://stackoverflow.com/questions/24686474/shipping-python-modules-in-pyspark-to-other-nodes) – Paul Sweatte Apr 28 '17 at 14:11

How do I install my code and dependencies in an AWS Spark cluster?

0 Answers0