Spark cache vs broadcast

Question

It looks like broadcast method makes a distributed copy of RDD in my cluster. On the other hand execution of cache() method simply loads data in memory.

But I do not understand how does cached RDD is distributed in the cluster.

Could you please tell me in what cases should I use rdd.cache() and rdd.broadcast() methods?

Question is misleading. Rdd does not have a broadcast method. It should be SparkContext.broadcast(v) https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#shared-variables — Gadam, Oct 16 '20 at 01:43

score 27 · Answer 1 · edited May 23 '17 at 12:18

cache() or persist() allows a dataset to be used across operations.

When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

Each persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes, or store it off-heap

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

You can find more details at this documentation page.

Useful posts:

Advantage of Broadcast Variables

What is the difference between cache and persist?

score 24 · Accepted Answer · edited Feb 05 '17 at 09:45

24

Could you please tell me in what cases should I use rdd.cache() and rdd.broadcast() methods?

RDDs are divided into partitions. These partitions themselves act as an immutable subset of the entire RDD. When Spark executes each stage of the graph, each partition gets sent to a worker which operates on the subset of the data. In turn, each worker can cache the data if the RDD needs to be re-iterated.

Broadcast variables are used to send some immutable state once to each worker. You use them when you want a local copy of a variable.

These two operations are quite different from each other, and each one represents a solution to a different problem.

edited Feb 05 '17 at 09:45

devict

589
4
8

answered Jun 27 '16 at 14:44

Yuval Itzchakov

146,575
32
257
321

1

So Does it mean that broadcast variables are in disk of executor and cache RDD are in physical memory of executors? – Surender Raja Apr 05 '20 at 07:14
1

@SurenderRaja That is how I understand it. But remember, that rdd.cache() can spill over to disk if it doesn't fit in RAM. – lightbox142 Jun 07 '21 at 17:41

Sachin Tyagi · Answer 3 · 2016-06-27T15:52:47.093

Could you please tell me in what cases should I use rdd.cache() and rdd.broadcast() methods?

Let's take an example -- say suppose you have an employee_salary data that contains department and salary of every employee. Now say the task is to find the fraction of average departmental salary for each employee. (If for employee e1 his dept is d1, we need to find e1.salary/average(all salaries in d1)).

Now one way to do this is -- you first read the data into an rdd -- say rdd1. And then do two things one after the other*-

First, calculate the department wise salary average using the rdd1*. You will eventually have the department average salaries result -- basically a map object containing of deptId vs average -- on the driver.

Second, you will need to use this result to divide the salary for each employee by their respective department's average salary. Remember that on each worker there can be employees from any department, so you will need to have access to the department wise average salary result on each worker. How to do this? Well, you can just send the average salary map you got on the driver to each worker in a broadcast and it can then be used in calculating the salary fractions for every "row" in the rdd1.

What about the caching an RDD? Remember that from the initial rdd1, there are two branches of computations -- one for calculating dept wise average and another of applying these averages on each employee in the rdd. Now, if you do not cache the rdd1, then for the second task above you may need to go back to disk again to read and recompute it because spark may have evicted this rdd from memory by the time you reach this point. But since we know that we will be using the same rdd we can ask Spark to keep it in memory the first time itself. Then next time we need to apply some transformations on it, we already have it in memory.

*We can use dept based partitioning so you can avoid the broadcast but for the purpose of illustration, let's say we do not.

Quan Bui · Answer 4 · 2023-08-11T08:01:29.767

Similarities

Both of them are mechanisms offered to boost performance in Spark.
The main goal is to make data more readily available for quick and efficient processing.

Differences

Broadcast

Is not a lazy evaluation operation.
Often and is recommended to use for small datasets.
Broadcast means to supply a copy of the data to each and every executor.
Because each executor already has the data required for computation, data communication costs over the network are reduced.

Cache

Is a lazy evaluation operation.
When an RDD is not cached, for every operation on the RDD, Spark will re-process all the steps required for that operation (e.g. show() will perform slowly every single time).
Cache will store the processed data along with its required steps to compute that data.
So when you call the same operation (more precisely, the operation with the same required steps), Spark retrieves the computed result stored in cache and no additional computation is needed.

Use cases

Broadcast - reduce communication costs of data over the network by provide a copy of shared data to each executor.
Cache - reduce computation costs of data for repeated operations by saving the processed data and its steps (for lookup).

Ntlzyjstdntcare · Answer 5 · 2017-12-17T17:13:15.933

Use Cases

You cache or broadcast an object when you want to use it multiple times.

You can only cache an RDD or RDD-derivative, whereas you can broadcast any kind of object, including RDDs.

We use cache() when we're dealing with an RDD/DataFrame/DataSet and we want to use the dataset multiple times without recomputing it afresh each time.

We broadcast an object when

we're dealing with an RDD/DataFrame/DataSet which is relatively small, and broadcasting it offers performance benefits over caching (e.g. if we're using the dataset in a join)
we're dealing with a plain old Scala/Java object and it will be used across multiple stages of a job.