Is it possible to run Spark Jobs e.g. Spark-sql jobs via Oozie?
In the past we have used Oozie with Hadoop. Since we are now using Spark-Sql on top of YARN, looking for a way to use Oozie to schedule jobs.
Thanks.
Is it possible to run Spark Jobs e.g. Spark-sql jobs via Oozie?
In the past we have used Oozie with Hadoop. Since we are now using Spark-Sql on top of YARN, looking for a way to use Oozie to schedule jobs.
Thanks.
Yup its possible ... The procedure is also same, that you have to provide Oozia a directory structure having coordinator.xml
, workflow.xml
and a lib directory containing your Jar files.
But remember Oozie starts the job with java -cp
command, not with spark-submit
, so if you have to run it with Oozie, Here is a trick.
Run your jar with spark-submit
in background.
Look for that process in process list. It will be running under java -cp
command but with some additional Jars, that are added by spark-submit
. Add those Jars in CLASS_PATH
. and that's it. Now you can run your Spark applications through Oozie.
1. nohup spark-submit --class package.to.MainClass /path/to/App.jar &
2. ps aux | grep '/path/to/App.jar'
EDITED: You can also use latest Oozie, which has Spark Action
also.
To run Spark SQL by Oozie you need to use Oozie Spark Action. You can locate oozie.gz on your distribution. Usually in cloudera you can find this oozie examples directory at below path. ]$ locate oozie.gz /usr/share/doc/oozie-4.1.0+cdh5.7.0+267/oozie-examples.tar.gz
Spark SQL need hive-site.xml file for execution which you need to provide in workflow.xml
< spark-opts>--file /hive-site.xml < /spark-opts>