0

I have a Spark standalone cluster running on a few machines. All workers are using 2 cores and 4GB of memory. I can start a job server with ./server_start.sh --master spark://ip:7077 --deploy-mode cluster --conf spark.driver.cores=2 --conf spark.driver.memory=4g, but whenever I try to start a server with more than 2 cores, the driver's state gets stuck at "SUBMITTED" and no worker takes the job.

I tried starting the spark-shell on 4 cores with ./spark-shell --master spark://ip:7077 --conf spark.driver.cores=4 --conf spark.driver.memory=4g and the job gets shared between 2 workers (2 cores each). The spark-shell gets launched as an application and not a driver though.

Is there any way to run a driver split between multiple workers? Or can I run the job server as an application rather than a driver?

yberg
  • 110
  • 2
  • 14
  • In the first example, you are deploying JobServer in cluster mode, which means that it will use 2 CPU cores of a Worker. If the context you create in JobServer asks more than the remaining available cores, it would explain your problem. This doesn't happen with spark-shell because it always run in client mode, then it will use 4 cores in the local machine. – Daniel de Paula May 16 '16 at 20:21
  • This is before I create a context though. I don't create a context in the job server that I deploy, but by calling http://ip:8090/contexts when the server is running. – yberg May 16 '16 at 20:25
  • Ok, I understood the problem. Each worker has only 2 cores, right? When you use `--deploy-mode cluster`, JobServer will be run in one worker and cannot be split. So the number of cores allocated to it must be less than the number of cores available on a single worker. Alternatively, you can run JobServer in client mode (`--deploy-mode client`), so it will use the cores available in the machine where you start the service. – Daniel de Paula May 16 '16 at 20:27
  • Difference between client and cluster mode: http://stackoverflow.com/questions/37027732/ – Daniel de Paula May 16 '16 at 20:30
  • Alright, both ways are working. But is there no way to run the driver distributed over multiple workers (and not just cores)? Like I can with an application, e.g. the spark-shell. – yberg May 16 '16 at 20:36
  • The spark-shell is always run in client mode (note that you cannot set it to cluster mode), so it actually runs only in one machine: the one that started it (it doesn't even need to be part of the cluster, and usually isn't). – Daniel de Paula May 16 '16 at 20:39
  • You may think that spark-shell is being split among workers because it automatically creates a context (which spark sees as an "application"). This context is then allocated the number of cores that it asks (you can change it by setting the configuration `spark.executor.cores`) – Daniel de Paula May 16 '16 at 20:42
  • http://imgur.com/MN9jyIb The two workers are run from two entirely different machines. Doesn't this mean that the application runs distributed across two workers? None of the machines have 8 cores. – yberg May 16 '16 at 20:59
  • The **context** created by spark-shell has been allocated two executors, but the spark-shell itself (the driver) is running on the machine where you started it. With JobServer, you can run it in client mode, then when you create a context, the jobs sent to JobServer to run on this context will be executed by multiple executors. – Daniel de Paula May 16 '16 at 21:08
  • Maybe reading [this page](http://spark.apache.org/docs/latest/cluster-overview.html) and [this page](http://spark.apache.org/docs/latest/spark-standalone.html) can help you. – Daniel de Paula May 16 '16 at 21:12
  • Alright, thank you very much for your help. I guess I can't get a driver to run across multiple workers then. I just don't see the point of having a cluster if I can't have workers share the workload of a job server. – yberg May 16 '16 at 21:32
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/112089/discussion-between-daniel-de-paula-and-yberg). – Daniel de Paula May 16 '16 at 21:36

1 Answers1

2

The problem was resolved in the chat

You have to change your JobServer .conf file to set the master parameter to point to your cluster:

master = "spark://ip:7077"

Also, the memory that JobServer program uses can be set in the settings.sh file.

After setting these parameters, you can start JobServer with a simple call:

./server_start.sh

Then, once the service is running, you can create your context via REST, which will ask the cluster for resources and will receive an appropriate number of excecutors/cores:

curl -d "" '[hostname]:8090/contexts/cassandra-context?context-factory=spark.jobserver.context.CassandraContextFactory&num-cpu-cores=8&memory-per-node=2g'

Finally, every job sent via POST to JobServer on this created context will be able to use the executors allocated to the context and will be able to run in a distributed way.

Daniel de Paula
  • 17,362
  • 9
  • 71
  • 72