0

For a project I would like to run Spark via a webpage. Here the goal is to submit dynamically submission requests and status updates. As inspiration I used the following weblink: When asking for I am sending a REST request for checking spark submission after submitting the below Spark request: http://arturmkrtchyan.com/apache-spark-hidden-rest-api

The Request code for a Spark job submission is the following:

curl -X POST http://sparkmasterIP:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
  "action" : "CreateSubmissionRequest",
  "appArgs" : [ "/home/opc/TestApp.jar"],
  "appResource" : "file:/home/opc/TestApp.jar",
  "clientSparkVersion" : "1.6.0",
  "environmentVariables" : {
    "SPARK_ENV_LOADED" : "1"
  },
  "mainClass" : "com.Test",
  "sparkProperties" : {
    "spark.driver.supervise" : "false",
    "spark.app.name" : "TestJob",
    "spark.eventLog.enabled": "true",
    "spark.submit.deployMode" : "cluster",
    "spark.master" : "spark://sparkmasterIP:6066"
  }
}'

Response:
{
  "action" : "CreateSubmissionResponse",
  "message" : "Driver successfully submitted as driver-20170302152313-0044",
  "serverSparkVersion" : "1.6.0",
  "submissionId" : "driver-20170302152313-0044",
  "success" : true
}

When asking for the submission status there were some difficulties. To request the submission status I used the submissionId displayed in the response code above. So the following command was used:

curl http://masterIP:6066/v1/submissions/status/driver-20170302152313-0044

The Response for Submission Status contained the following error:

"message" : "Exception from the cluster:\njava.io.FileNotFoundException: /home/opc/TestApp.jar denied)\n\tjava.io.FileInputStream.open0(Native Method)\n\tjava.io.FileInputStream.open(FileInputStream.java:195)\n\tjava.io.FileInputStream.<init>(FileInputStream.java:138)\n\torg.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)\n\torg.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)\n\torg.spark-project.guava.io.ByteSource.copyTo(ByteSource.java:202)\n\torg.spark-project.guava.io.Files.copy(Files.java:436)\n\torg.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:540)\n\torg.apache.spark.util.Utils$.copyFile(Utils.scala:511)\n\torg.apache.spark.util.Utils$.doFetchFile(Utils.scala:596)\n\torg.apache.spark.util.Utils$.fetchFile(Utils.scala:395)\n\torg.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)\n\torg.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:79)",

My question is how to use such an API, in such a way that the submission status can be obtained. If there is another API where the correct status can be obtained, then I would like a short description of how this API works in a RESTful way.

Thanks

Paul Velthuis
  • 325
  • 4
  • 15
Saurabh Rana
  • 168
  • 3
  • 22

1 Answers1

0

As noted down in the comments of the blog http://arturmkrtchyan.com/apache-spark-hidden-rest-api , some more commenter's are experiencing this problem as well. Here below I will try to explain some of the possible reasons.

It looks like your file:/home/opc/TestApp.jar is not found/denied. This might be because of the directory cannot be found (access denied,cannot find). This is likely because it is not there on all nodes and the Spark submit is in cluster mode. As noted in the Spark documentation for application jar Spark documentation. Application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.

To solve this one of the recommendations I can do is to execute the command using spark-submit. More information about spark-submit can be found at Spark submit and a book by Jacek Laskowski

spark-submit --status [submission ID] --master [spark://...]
Community
  • 1
  • 1
Paul Velthuis
  • 325
  • 4
  • 15
  • Thanks. I tried few steps .I did chmod 777 to the file but still it gave same error. Also tried putting this file on all the cluster nodes.Putting the file on hdfs did not work. Regarding your recommendation to execute using spark-submit, I am trying by using REST because I want to do it dynamically from the web page. If it works, I will call the REST request from the web page and start the job. – Saurabh Rana Mar 03 '17 at 06:17
  • If you want to use it dynamically from a web page the things change a little. One way to do it would be to use a jobserver. When you install the job server you can run restful commands from your interface. https://github.com/spark-jobserver/spark-jobserver . With the commands listed in [jobs](https://github.com/spark-jobserver/spark-jobserver#jobs) you can then get the status. – Paul Velthuis Mar 03 '17 at 08:17
  • Thanks. I will try spark-jobserver. Why is Spark hidden REST not suitable for this case(Calling from the web)? – Saurabh Rana Mar 06 '17 at 06:36
  • The jobserver uses the Spark hidden Rest, or the **Restful** messages better said. Also Spark submit spits hidden rest out. So both can be done dynamically from the web. It is just that the Spark hidden Rest proposed by the author, of the weblink in your question, is a little bit more hacky, he tries to form the spark-submit messages to curl messages. – Paul Velthuis Mar 06 '17 at 08:07
  • What if I write java code to call spark-submit with params and then expose this code as Restful web service which can be consumed by the code from web page? – Saurabh Rana Mar 06 '17 at 09:59
  • @Saurav Stackoverflow isn't there for debugging your problems. Please read the guides. When calling Java code you can send the output and the error messages to the webpage. You can also program it by using Java Servlets. I myself would still go for the jobserver. Hadoop is for big data, so keep in mind that it is not build for such a thing. What you want is a **web application** read through the guides of Java how to do this, I can recommend following courses. Remember that Spark is not build for immediately returning the result. – Paul Velthuis Mar 06 '17 at 13:22