1

I'm submitting a spark job from a shell script that has a bunch of env vars and parameters to pass to spark. Strangely, the driver host is not one of these parameters (there are driver cores and memory however). So if I have 3 machines in the cluster, a driver will be chosen randomly. I don't want this behaviour since 1) the jar I'm submitting is only on one of the machines and 2) the driver machine should often be smaller than the other machines which is not the case if it's random choice.

So far, I found no way to specify this param on the command line to spark-submit. I've tried --conf SPARK_DRIVER_HOST="172.30.1.123, --conf spark.driver.host="172.30.1.123 and many other things but nothing has any effect. I'm using spark 2.1.0. Thanks.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
Dmitri
  • 153
  • 1
  • 10

2 Answers2

1

I assume you are running with Yarn cluster. In brief yarn uses containers to launch and implement tasks. And resource manager decides where to run which container based on availability of resources. In spark case drivers and executors also launched as containers with separate jvms. Driver dedicated to splitting tasks among executors and collect the results from them. If your node from where you launch your application included in cluster then it will be also used as shared resource for launching driver/executor.

FaigB
  • 2,271
  • 1
  • 13
  • 22
  • Okay, I think I don't have a clear understanding of the execution model it seems. If executors are launched automatically, do I need to specify `--executor-memory` and `--executor-cores` (and same for the driver)? Currently if I don't specify memory, it seems only 1GB is used (although there's much more available) and get `OutOfMemoryError`. – Dmitri Mar 04 '17 at 15:37
  • My suggestion for you to read book Spark in action. All the points for your case are described there – FaigB Mar 04 '17 at 19:26
1

From the documentation: http://spark.apache.org/docs/latest/running-on-yarn.html

When running the cluster in standalone or in Mesos the driver host (this is the master) can be launched with:

--master <master-url> #e.g. spark://23.195.26.187:7077

When using YARN it works a little different. Here the parameter is yarn

--master yarn

The yarn is specified in Hadoop its configuration for the ResourceManager. For how to do this see this interesting guide https://dqydj.com/raspberry-pi-hadoop-cluster-apache-spark-yarn/ . Basically in the hdfs the hdfs-site.xml and in yarn the yarn-site.xml

Paul Velthuis
  • 325
  • 4
  • 15
  • I am indeed specifying master with `--master` parameter, the driver is however still chosen randomly. – Dmitri Mar 04 '17 at 15:35
  • For the out of memory error it is useful to look at the following stack overflow http://stackoverflow.com/questions/21138751/spark-java-lang-outofmemoryerror-java-heap-space. With the option spark.executor.memory the amount of memory can be specified, this is when running in cluster. For further memory information read: http://spark.apache.org/docs/latest/configuration.html . When you use Spark local, Spark spawns all the execution components - driver, executor, backend, and master - in the same JVM. This is the only mode where a driver is used for execution. – Paul Velthuis Mar 04 '17 at 17:28
  • One solution to your memory problem could be to use dynamic allocation. This allows for a finer grained control. This is more specified in the blog http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ . When starting a cluster you can specify in conf/spark-env.sh where the master is started. Here will also be your driver. See http://spark.apache.org/docs/latest/spark-standalone.html. – Paul Velthuis Mar 04 '17 at 17:47