10

Different sources (e.g. 1 and 2) claim that Spark can benefit from running multiple tasks in the same JVM. But they don't explain why.

What are these benefits?

Andronicus
  • 25,419
  • 17
  • 47
  • 88
Marek Grzenkowicz
  • 17,024
  • 9
  • 81
  • 111

2 Answers2

6

As it was already said broadcast variables is one thing.

Another are problems with concurrency. Take a look at this of code:

var counter = 0
var rdd = sc.parallelize(data)

rdd.foreach(x => counter += x)

println(counter)

The result may be different depending, whether executed locally or on a Spark deployed on clusters (with different JVM). In the latter case the parallelize method splits the computation between the executors. The closure (environment needed for every node to do its task) is computed, which means, that every executor receives a copy of counter. Each executor sees its own copy of the variable, thus the result of the calculation is 0, as none of the executor referenced the right object. Within one JVM on the other hand counter is visible to every worker.

Of course there is a way to avoid that - using Acumulators (see here).

Last but not least when persisting RDDs in memory (default cache method storage level is MEMORY_ONLY), it will be visible within single JVM. This can also be overcome by using OFF_HEAP (this is experimental in 2.4.0). More here.

Andronicus
  • 25,419
  • 17
  • 47
  • 88
5

The biggest possible advantage is shared memory, in particular handling broadcasted objects. Because these objects are considered read-only there can be shared between multiple threads.

In scenario when you use a single task / executor you need a copy for each JVM so with N tasks there is N copies. With large objects this can be a serious overhead.

Same logic can be applied to other shared objects.

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
  • I saw broadcast variables mentioned in this context [here](http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/). However, I find it hard to believe that it's only about broadcast variables. – Marek Grzenkowicz Dec 07 '17 at 11:42
  • 1
    Can the threads running in the same JVM share cached data (partitions)? This could allow to optimize shuffling, because partitions processed by a single executor could be shuffled in memory, without writing them to disk. But this is only a speculation on my side, I don't know if this is possible. – Marek Grzenkowicz Dec 07 '17 at 11:43