4

I want to use Airflow for orchestration of jobs that includes running some pig scripts, shell scripts and spark jobs.

Mainly on Spark jobs, I want to use Apache Livy but not sure whether it is good idea to use or run spark-submit.

What is best way to track Spark job using Airflow if even I submitted?

Ramdev Sharma
  • 974
  • 1
  • 12
  • 17

1 Answers1

3

My assumption is you an application JAR containing Java / Scala code that you want to submit to remote Spark cluster. Livy is arguably the best option for remote spark-submit when evaluated against other possibilities:

  • Specifying remote master IP: Requires modifying global configurations / environment variables
  • Using SSHOperator: SSH connection might break
  • Using EmrAddStepsOperator: Dependent on EMR

Regarding tracking

  • Livy only reports state and not progress (% completion of stages)
  • If your'e OK with that, you can just poll the Livy server via REST API and keep printing logs in console, those will appear on task logs in WebUI (View Logs)

Other considerations

  • Livy doesn't support reusing SparkSession for POST/batches request
  • If that's imperative, you'll have to write your application code in PySpark and use POST/session requests

References


Useful links

y2k-shubham
  • 10,183
  • 11
  • 55
  • 131
  • Thank you @y2k-shubham . My source code is in scala and want to run as Application using jar. It looks like Livy is better option and i will submit using batches with POST method. Tracking progress is problematic but I think for now tracking status should be good enough. – Ramdev Sharma Jan 17 '19 at 14:37