Configuring spark-submit to a remote AWS EMR cluster

Question

We are building an airflow server on an EC2 instance that communicates to an EMR cluster to run spark jobs. We are trying to submit a BashOperator DAG that runs a spark-submit command for a simple wordcount application. Here is our spark submit command below:

./spark-submit --deploy-mode client --verbose --master yarn wordcount.py s3://bucket/inputwordcount.txt s3://bucket/outputbucket/ ;

We're getting the following error: Exception in thread "main" org.apache.spark.SparkException: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.

So far we've set HADOOP_CONF_DIR and YARN_CONF_DIR to /etc/hadoop/ in our EC2 instance in our .bashrc and have copied the spark-env.sh from the EMR cluster to /etc/hadoop/ on the EC2 Instance

We aren't too sure what files we are supposed to copy over to HADOOP_CONF_DIR/YARN_CONF_DIR directory in the EC2 for the spark-submit command to send the job to the EMR cluster running spark. Has anyone had experience configured a server to send spark commands to a remote server, we would appreciate the help!

Not answering your question but there are alternative ways for doing spark-submit to remote EMR via Airflow. See [this](https://stackoverflow.com/a/54092691/3679900) — y2k-shubham, Jul 19 '19 at 08:02
I actually saw this post earlier yesterday before asking this question, and it was very useful. We are trying to design our cluster to handle all of these methods. So our questions are strictly related to submitting spark jobs using spark-submit command. — AsapYAMLGang, Jul 19 '19 at 13:36

score -1 · Answer 1 · answered Jul 19 '19 at 08:37

-1

I think the issue it that you are running spark-submit on the EC2 machine. I would suggest you to create EMR cluster with corresponding step. Here is an example from Airflow repo itself. Or if you prefer using BashOperator, you should use aws cli. Namely you can use aws emr command.

answered Jul 19 '19 at 08:37

gorros

1,411
1
18
29

So we are successfully able to run step functions and aws cli using the BashOperator, but in addition to that we would like to get the remote spark-submit command to also function. We're trying to design our cluster to handle all submit methods. – AsapYAMLGang Jul 19 '19 at 13:27

Configuring spark-submit to a remote AWS EMR cluster

1 Answers1