1

Some run-time clarifications are requested.

In a thread elsewhere I read, it was stated that a Spark Executor should only have a single Core allocated. However, I wonder if this is really always true. Reading the various SO-questions and the likes of, as well as Karau, Wendell et al, it is clear that there are equal and opposite experts who state one should in some cases specify more Cores per Executor, but the discussion tends to be more technical than functional. That is to say, functional examples are lacking.

  • My understanding is that a Partition of an RDD or DF, DS, is serviced by a single Executor. Fine, no issue, makes perfect sense. So, how can the Partition benefit from multiple Cores?

    • If I have a map followed by, say a, filter, these are not two Tasks that can be interleaved - as in what Informatica does, as my understanding is they are fused together. This being so, then what is an example of benefit from an assigned Executor running more Cores?

    • From JL: In other (more technical) words, a Task is a computation on the records in a RDD partition in a Stage of a RDD in a Spark Job. What does it mean functionally speaking, in practice?

  • Moreover, can Executor be allocated if not all Cores can be acquired? I presume there is a wait period and that after a while it may be allocated in a more limited capacity. True?

  • From a highly rated answer on SO, What is a task in Spark? How does the Spark worker execute the jar file?, the following is stated: When you create the SparkContext, each worker starts an executor. From another SO question: When a SparkContext is created, each worker node starts an executor.

    Not sure I follow these assertions. If Spark does not know the number of partitions etc. in advance, why allocate Executors so early?

I ask this, as even this excellent post How are stages split into tasks in Spark? does not give a practical example of multiple Cores per Executor. I can follow the post clearly and it fits in with my understanding of 1 Core per Executor.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83

1 Answers1

3

My understanding is that a Partition (...) serviced by a single Executor.

That's correct, however the opposite is not true - a single executor can handle multiple partitions / tasks across multiple stages or even multiple RDDs).

then what is an example of benefit from an assigned Executor running more Cores?

First and foremost processing multiple tasks at the same time. Since each executor is a separate JVM, which is a relatively heavy process, it might preferable to keep only instance for a number of threads. Additionally it can provide further advantages, like exposing shared memory that can be used across multiple tasks (for example to store broadcast variables).

Secondary application is applying multiple threads to a single partition when user invokes multi-threaded code. That's however not something that is done by default (Number of CPUs per Task in Spark)

See also What are the benefits of running multiple Spark tasks in the same JVM?

If Spark does not know the number of partitions etc. in advance, why allocate Executors so early?

Pretty much by extension of the points made above - executors are not created to handle specific task / partition. There are long running processes, and as long as dynamic allocation is not enabled, there are intended to last for the full lifetime of the corresponding application / driver (preemption or failures, as well as already mentioned dynamic allocation, can affect that, but that's the basic model).

user10938362
  • 3,991
  • 2
  • 12
  • 29
  • 1
    2 executors - 2 JVMs, 1 executor 1 JVM and given JVM overhead and Spark bookkeeping utilities is easily 1GB RAM for idle executor.JVM. – user10938362 Jan 24 '19 at 22:47