0

I've always heard that Spark is 100x faster than classic Map Reduce frameworks like Hadoop. But recently I'm reading that this is only true if RDDs are cached, which I thought was always done but instead requires the explicit cache () method.

I would like to understand how all produced RDDs are stored throughout the work. Suppose we have this workflow:

  1. I read a file -> I get the RDD_ONE
  2. I use the map on the RDD_ONE -> I get the RDD_TWO
  3. I use any other transformation on the RDD_TWO

QUESTIONS:

if I don't use cache () or persist () is every RDD stored in memory, in cache or on disk (local file system or HDFS)?

if RDD_THREE depends on RDD_TWO and this in turn depends on RDD_ONE (lineage) if I didn't use the cache () method on RDD_THREE Spark should recalculate RDD_ONE (reread it from disk) and then RDD_TWO to get RDD_THREE?

Thanks in advance.

3 Answers3

1

In spark there are two types of operations: transformations and actions. A transformation on a dataframe will return another dataframe and an action on a dataframe will return a value.

Transformations are lazy, so when a transformation is performed spark will add it to the DAG and execute it when an action is called.

Suppose, you read a file into a dataframe, then perform a filter, join, aggregate, and then count. The count operation which is an action will actually kick all the previous transformation.

If we call another action(like show) the whole operation is executed again which can be time consuming. So, if we want not to run the whole set of operation again and again we can cache the dataframe.

Few pointers you can consider while caching:

  1. Cache only when the resulting dataframe is generated from significant transformation. If spark can regenerate the cached dataframe in few seconds then caching is not required.
  2. Cache should be performed when the dataframe is used for multiple actions. If there are only 1-2 actions on the dataframe then it is not worth saving that dataframe in memory.
Partha Deb
  • 143
  • 1
  • 14
  • Ok thanks very clear. But when I start the first action, all RDDs are created. I've always heard that these RDDs are in memory, but if I start the second action why don't I already use the RDDs in memory but reacreate them from scratch? Why am I forced to cache and it doesn't do it automatically? Where are cached RDDs stored? In HDFS (or local)? – Carlo Casadei Jun 11 '21 at 10:59
  • The memory is limited we don't want to use all our memory for storing RDD and then left with nothing for processing. So, its upto the user to cache the memory according to their use. – Partha Deb Jun 11 '21 at 16:13
0

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes

  • Thanks. I ask you the same question asked to everyone. Ok thanks very clear. When I start the first action, all RDDs are created. I've always heard that these RDDs are in memory, but if I start the second action why don't I already use the RDDs in memory but reacreate them from scratch? Why am I forced to cache and it doesn't do it automatically? Where are cached RDDs stored? In HDFS (or local)? – Carlo Casadei Jun 11 '21 at 11:00
  • Hope this will help you. https://stackoverflow.com/a/28983767/7808058 – Buvaneswari Viswanathan Jun 16 '21 at 11:05
0

To Answer your question:

Q1:if I don't use cache () or persist () is every RDD stored in memory, in cache or on disk (local file system or HDFS)? Ans: Considering the data which is available in workers node as blocks in HDFS, when creating rdd for the file as

val rdd=sc.textFile("<HDFS Path>")

the underlying blocks of data from each nodes (HDFS) will be loaded to their RAM's(i,e memory) as partitions (in spark, the blocks of hdfs data are called as partitions once loaded into memory)

Q2: if RDD_THREE depends on RDD_TWO and this in turn depends on RDD_ONE (lineage) if I didn't use the cache () method on RDD_THREE Spark should recalculate RDD_ONE (reread it from disk) and then RDD_TWO to get RDD_THREE? Ans: Yes.Since the underlying results are not stored in drivers memory by using cache() in this scenario.