2

Is it possible to run Spark Jobs e.g. Spark-sql jobs via Oozie?

In the past we have used Oozie with Hadoop. Since we are now using Spark-Sql on top of YARN, looking for a way to use Oozie to schedule jobs.

Thanks.

Bill Jash
  • 33
  • 1
  • 3

2 Answers2

3

Yup its possible ... The procedure is also same, that you have to provide Oozia a directory structure having coordinator.xml, workflow.xml and a lib directory containing your Jar files.
But remember Oozie starts the job with java -cp command, not with spark-submit, so if you have to run it with Oozie, Here is a trick.
Run your jar with spark-submit in background. Look for that process in process list. It will be running under java -cp command but with some additional Jars, that are added by spark-submit. Add those Jars in CLASS_PATH. and that's it. Now you can run your Spark applications through Oozie.

1.  nohup spark-submit --class package.to.MainClass /path/to/App.jar &
2.  ps aux | grep '/path/to/App.jar'

EDITED: You can also use latest Oozie, which has Spark Action also.

Zia Kiyani
  • 812
  • 5
  • 21
  • Could you please give an example how you triggered this from the Oozie workflow please. What I tried was using the command in Oozie and then in the shell script itself, I ran `nohup spark-submit --class ... /path/to/app.jar &` but that didn't seem to work. It seems Oozie did nothing but just quit so no spark job was submitted. What I am trying to do is to let Oozie submit the spark job and then quit (mark the job as success & completed) because it consume quite a lot of resources otherwise (2 cores & 2G of ram as a minimum, I can't find away to make it go lower). Thanks a lot! – RHE Aug 15 '16 at 16:41
  • 1
    I didn't get you, what you are actually trying to do, can you please elaborate this?? – Zia Kiyani Aug 15 '16 at 17:09
  • Hi Zia, thanks for the reply. When you run a spark job using Oozie, let's say the Spark job takes 20 minutes to finish, usually the Oozie job will finish after the Spark job finishes, in other words after 20 minutes. What I would like to do is to finish the Oozie process early (i.e. by running the Spark job in the background using nohup or disown) immediately after `spark-submit` is run. You probably don't want to do this for a normal Spark job, but for Spark streaming it kind of makes sense because Spark streaming jobs runs 24/7 non-stop. Maybe I shouldn't use Oozie for Spark Streaming... – RHE Aug 15 '16 at 22:10
  • Ohh I got it now. Yes for spark streaming why you want to use Oozie? As spark streaming run continuously, Oozie is used where we have to schedule job after intervals. Anyway if you still want this then the best option is run a command using your code. But for that, you have to use `` command in a daemon thread. So that your command can run after the program terminates. – Zia Kiyani Aug 16 '16 at 09:05
  • What is Daemon thread in Java? http://stackoverflow.com/questions/2213340/what-is-daemon-thread-in-java – Zia Kiyani Aug 16 '16 at 09:08
  • Here is a code example to run a Daemon thread. http://stackoverflow.com/questions/30706704/java-run-async-processes Hopefully it will help. – Zia Kiyani Aug 16 '16 at 09:09
  • Yup, I talked myself out of it I guess :) using Oozie is not right for Spark Streaming. The reason we used it was because we used oozie to submit other spark jobs initially then just used the same thing for spark streaming. Just out of interest, what do you use to deploy and run spark streaming jobs? Thanks a lot again Zia, and I will read into daemon thread too. – RHE Aug 17 '16 at 22:29
  • 1
    Good, Yes, this is a perfect way, schedule batch jobs through Oozie and run streaming jobs with `spark-submit`. I also use the same technique. – Zia Kiyani Aug 18 '16 at 09:24
0

To run Spark SQL by Oozie you need to use Oozie Spark Action. You can locate oozie.gz on your distribution. Usually in cloudera you can find this oozie examples directory at below path. ]$ locate oozie.gz /usr/share/doc/oozie-4.1.0+cdh5.7.0+267/oozie-examples.tar.gz

Spark SQL need hive-site.xml file for execution which you need to provide in workflow.xml

< spark-opts>--file /hive-site.xml < /spark-opts>

Arvind Kumar
  • 1,325
  • 1
  • 19
  • 27