9

We have a requirement to schedule spark jobs, since we are familiar with apache-airflow we want to go ahead with it to create different workflows. I searched web but did not find a step by step guide to schedule spark job on airflow and option to run them on different server running master.

Answer to this will be highly appreciated. Thanks in advance.

Raghav salotra
  • 820
  • 1
  • 11
  • 23

1 Answers1

19

There are 3 ways you can submit Spark jobs using Apache Airflow remotely:

(1) Using SparkSubmitOperator: This operator expects you have a spark-submit binary and YARN client config setup on our Airflow server. It invokes the spark-submit command with given options, blocks until the job finishes and returns the final status. The good thing is, it also streams the logs from the spark-submit command stdout and stderr.

You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work.

Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.

If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath

(2) Using SSHOperator: Use this operator to run bash commands on a remote server (using SSH protocol via paramiko library) like spark-submit. The benefit of this approach is you don't need to copy the hdfs-site.xml or maintain any file.

(3) Using SimpleHTTPOperator with Livy: Livy is an open source REST interface for interacting with Apache Spark from anywhere. You just need to have REST calls.

I personally prefer SSHOperator :)

kaxil
  • 17,706
  • 2
  • 59
  • 78
  • 1
    But in case of SSHOperator we need to run airflow on same server of spark master node. Isn't it? – Raghav salotra Nov 20 '18 at 06:59
  • 1
    No, that is `BashOperator`. With `SSHOperator` you can run bash commands on a remote server. I have updated my answer to include this info. – kaxil Nov 20 '18 at 09:10
  • The SSHOperator does not work in my case. The Spark-submit job runs forever and when i run the same job from shell it executes within 5 minutes. Kindly suggest. Link :https://stackoverflow.com/questions/56988228/running-spark-job-using-paramiko-library – satish silveri Jul 11 '19 at 11:45
  • @kaxil With Approach #2 SSHOperator, even though the task is ran successfully, Airflow marks it as Failed. And I ran with test code: ```command='bash tmp.sh',``` even this would return as Failed in Airflow. Any idea? – Sunny May 29 '21 at 01:09