I can create a Spark cluster on AWS as described here.
However, my own Python code and pip libraries need to run on the master and workers. This is a lot of code, and the pip installation process compiles some native libraries as well, so I can't simply have Spark distribute this code at runtime using techniques such as registering a pip requirements file with spark_context or the --py-files argument of spark-submit.
Of course I could run a bash script right after running aws emr create-cluster
, but I wonder if there is a more automatic way, so that I can avoid the maintenance of a big bash script for installation.
So, what is the best way to set up clusters to include my code and dependencies?