0

I have a requirement, where I need to submit a spark job using Airflow. Airflow and Hadoop clusters are on different server.

Currently, simple solution to use use a BashOperator and ssh into Hadoop cluster machine and submit the job.

But,I want to also explore the SparkSubmitOperaror.

I have gone through many articles and stackoverflow question, but did not found any detailed explanation on how to setup Airflow server for this to work. I found below stackoverflow question, where he mentions that,we need to have spark-binaries and need to configure yarn-site.xml on Airflow machine.

Is there a way to submit spark job on different server running master

But, nowhere ,I found how to setup this things.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245

1 Answers1

0

As written in the other post

You really only need to configure a yarn-site.xml file

This defines the YARN ResourceManager address(es)

Similarly, core-site.xml defines the Namenode address(es) / nameservice for HDFS.

You'll need to package both files in any node that'll be capable of running Spark code using an environment variable SPARK_CONF_DIR... And, as documented on the SparkSubmitOperator, you'll need to install Spark on each Airflow worker, then ensure spark-submit is on the OS PATH; in other words, it's the exact same as BashOperator, but provides way to explicitly call/configure the Spark executable. You don't need SSH to run Spark.


Other than that, you may need to configure OS firewall rules, or network routes, but that's not specific to Hadoop or Airflow

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245