10

I am able to run Spark job using BashOperator but I want to use SparkSubmitOperator for it using Spark standalone mode.


Here's my DAG for SparkSubmitOperator and stack-trace

args = {
    'owner': 'airflow',
    'start_date': datetime(2018, 5, 24)
}
dag = DAG('spark_job', default_args=args, schedule_interval="*/10 * * * *")

operator = SparkSubmitOperator(
    task_id='spark_submit_job',
    application='/home/ubuntu/test.py',
    total_executor_cores='1',
    executor_cores='1',
    executor_memory='2g',
    num_executors='1',
    name='airflow-spark',
    verbose=False,
    driver_memory='1g',
    conf={'master':'spark://xx.xx.xx.xx:7077'},
    dag=dag,
)

Looking at source for spark_submit_hook it seems _resolve_connection() always sets master=yarn. How can I change master properties value by Spark standalone master URL? Which properties I can set to run Spark job in standalone mode?

y2k-shubham
  • 10,183
  • 11
  • 55
  • 131
mandar
  • 307
  • 1
  • 4
  • 20

1 Answers1

16

You can either create a new connection using the Airflow Web UI or change the spark-default connection.

Change Spark-default connection in Airflo

Master can be local, yarn, spark://HOST:PORT, mesos://HOST:PORT and k8s://https://<HOST>:<PORT>.

You can also supply the following commands in the extras:

{"queue": "root.default", "deploy_mode": "cluster", "spark_home": "", "spark_binary": "spark-submit", "namespace": "default"}

Airflow Spark Submit Extras

Either the "spark-submit" binary should be in the PATH or the spark-home is set in the extra on the connection.

kaxil
  • 17,706
  • 2
  • 59
  • 78
  • above solution works perfectly by changing connection details for spark-default. Thanks – mandar May 28 '18 at 07:18
  • 2
    I am running two container, one of them is for spark and other is airflow. How can I set spark-submit binary? – ugur Sep 21 '18 at 04:23
  • 4
    Be careful with `{"queue": "root.default", "deploy_mode": "cluster", "spark_home": "", "spark_binary": "spark-submit", "namespace": "default"}`, it should be `deploy-mode` instead of `deploy_mode`, Using spark 2.4.2 the former didn't work, it's not straightforward to debug :) – Billel Guerfa Jul 24 '19 at 13:30
  • 1
    `{"queue": "default", "deploy-mode": "cluster", "spark-home": "", "spark-binary": "spark-submit", "namespace": "default"}` from https://airflow.readthedocs.io/en/latest/howto/connection/spark.html – Ganesh Aug 28 '20 at 18:36