Thanks to the clue by @Pandey. The answer https://stackoverflow.com/a/37980813/5634636 helps me a lot.
TL;DR
Detailed description
NOTE: I only test my approaches on Apache Spark 2.3.1. I can't guarantee that it will work in other versions as well.
Let's clear my requirement first. There're 3 features I wanted:
- remote submit a spark job
- check job status anytime (RUNNING, ERROR, FINISHED...)
- get the error message if there is something error
Submit locally
NOTE: this answer only works in cluster mode
The Spark tool spark-submit
will help.
Submit remotely
Spark submission API is recommended. It seems that there is not any documentation on Apache Spark official website, so some people call it hidden API. For details, see: https://www.nitendragautam.com/spark/submit-apache-spark-job-with-rest-api/
- To submit Spark job, use submit API
- To get the status of the job, use status API:
http://<master-ip>:6066/v1/submissions/status/<submission-id>
. The submission-id
will be returned in a json when you submit jobs.
- The error message is included in the status message.
- More about the error message: note the difference between status ERROR and FAILED. In short, the FAILED means that there is something wrong during executing Spark jobs (e.g. uncaught exceptions), while the ERROR means there's something error during submitting (e.g. invalid jar path). The error message is included in the status json. If you want to view the FAILED reason, it can be accessed via
http://<driver-ip>:<ui-port>/log/<submission-id>
.
Here is an example of error status (**** is an incorrect jar path which is miswritten intentionally):
{
"action" : "SubmissionStatusResponse",
"driverState" : "ERROR",
"message" : "Exception from the cluster:\njava.io.FileNotFoundException: File hdfs:**** does not exist.\n\torg.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:795)\n\torg.apache.hadoop.hdfs.DistributedFileSystem.access$700(DistributedFileSystem.java:106)\n\torg.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:853)\n\torg.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:849)\n\torg.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)\n\torg.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:860)\n\torg.apache.spark.util.Utils$.fetchHcfsFile(Utils.scala:727)\n\torg.apache.spark.util.Utils$.doFetchFile(Utils.scala:695)\n\torg.apache.spark.util.Utils$.fetchFile(Utils.scala:488)\n\torg.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:155)\n\torg.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:173)\n\torg.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:92)",
"serverSparkVersion" : "2.3.1",
"submissionId" : "driver-20190315160943-0005",
"success" : true,
"workerHostPort" : "172.18.0.4:36962",
"workerId" : "worker-20190306214522-172.18.0.4-36962"
}