Could you please tell me in what cases should I use rdd.cache() and rdd.broadcast() methods?
Let's take an example -- say suppose you have an employee_salary data that contains department and salary of every employee. Now say the task is to find the fraction of average departmental salary for each employee. (If for employee e1 his dept is d1, we need to find e1.salary/average(all salaries in d1)).
Now one way to do this is -- you first read the data into an rdd -- say rdd1. And then do two things one after the other*-
First, calculate the department wise salary average using the rdd1*. You will eventually have the department average salaries result -- basically a map object containing of deptId vs average -- on the driver.
Second, you will need to use this result to divide the salary for each employee by their respective department's average salary. Remember that on each worker there can be employees from any department, so you will need to have access to the department wise average salary result on each worker. How to do this? Well, you can just send the average salary map you got on the driver to each worker in a broadcast and it can then be used in calculating the salary fractions for every "row" in the rdd1.
What about the caching an RDD? Remember that from the initial rdd1, there are two branches of computations -- one for calculating dept wise average and another of applying these averages on each employee in the rdd. Now, if you do not cache the rdd1, then for the second task above you may need to go back to disk again to read and recompute it because spark may have evicted this rdd from memory by the time you reach this point. But since we know that we will be using the same rdd we can ask Spark to keep it in memory the first time itself. Then next time we need to apply some transformations on it, we already have it in memory.
*We can use dept based partitioning so you can avoid the broadcast but for the purpose of illustration, let's say we do not.