2

I have a question regarding Apache Spark running on YARN in cluster mode. According to this thread, Spark itself does not have to be installed on every (worker) node in the cluster. My problem is with the Spark Executors: In general, YARN or rather the Resource Manager is supposed to decide about resource allocation. Hence, Spark Executors could be launched randomly on any (worker) node in the cluster. But then, how can Spark Executors be launched by YARN if Spark is not installed on any (worker) node?

Community
  • 1
  • 1
h4wX
  • 147
  • 7
  • Executors need to have Spark runtime available somehow. That could be either by installing on the nodes or shipping it with your application, e.g. in a fat jar that bundles Spark. I think... – LiMuBei Dec 16 '16 at 09:57
  • 1
    You don't have to include the binaries in a fatjar/uberjar -- it's automatically delivered by spark-submit. – Jacek Laskowski Dec 16 '16 at 13:00

1 Answers1

2

In a high level, When Spark application launched on YARN,

  1. An Application Master(Spark specific) will be created in one of the YARN Container.
  2. Other YARN Containers used for Spark workers(Executors)

Spark driver will pass serialized actions(code) to executors to process data.

spark-assembly provides spark related jars to run Spark jobs on a YARN cluster and application will have its own functional related jars.


Edit: (2017-01-04)

Spark 2.0 no longer requires a fat assembly jar for production deployment.source

mrsrinivas
  • 34,112
  • 13
  • 125
  • 125
  • Thanks for your answer. I do know about the YARN containers. As far as I have understood it, an executor (which is actually a process) is launched "in" a container. But still, the executor itself is a Spark specific component. Thus, my question: How can Spark Executors be launched (within a YARN container) if Spark is not installed on any worker node? – h4wX Dec 16 '16 at 09:54
  • 1
    Yes! an executor is a process and all the computing logic will be passed to it by spark driver program. other jars like **spark-assembly** will be available to all workers by moving them to HDFS when application launched (it's an automated process). – mrsrinivas Dec 16 '16 at 12:04
  • Yes, that's right, but how does this work in the specific case? – h4wX Dec 16 '16 at 12:09
  • Does this mean that specifically the spark-assembly jar (which contains all relevant dependencies for the application) is required in order to launch Spark Executors if Spark is not installed on the worker nodes? – h4wX Dec 16 '16 at 12:37
  • 1
    true.. spark-assembly provides spark related jars to run Spark jobs on a YARN cluster and application will have its own functional related jars. – mrsrinivas Dec 16 '16 at 12:44