0

1.What is the Default levels of persistence for cache() in Apache Spark in Python

MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER

2.As per "Learning Spark" book, persist() Default levels of persistence is MEMORY_ONLY_SER is that correct?

2 Answers2

0

What Apache Spark version are you using? Supposing you're using the latest one (2.3.1):

Regarding the Python documentation for Spark RDD Persistence documentation , the storage level when you call both cache() and persist() methods is MEMORY_ONLY.

Only memory is used to store the RDD by default.

Also, if you specify the Apache Spark version you are using or the version that is being referred by the "Learning Spark" book, we could help you better.

Álvaro Valencia
  • 1,187
  • 8
  • 17
0

It is MEMORY_ONLY by now. checkout the source code, in Scala, but simple:

def cache(): this.type = persist()
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
def persist(newLevel: StorageLevel): this.type = {
  // doing stuff...
}

The storage level you should use depends on the RDD itself. For example, when you have no enough RAM, and with MEMORY_ONLY level, you will lose the data and have to calculate again from the beginning. Or, if it is MEMORY_AND_DISK, you will still have a backup on the disk and can read it from the hard disk.

So, most of the time, recalculating these data is faster than reading from the disk (and you have to write it to the disk when persisting, which is even slower). That's why MEMORY_ONLY is the default value.

And differences of the levels can be found in the official guide. https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence

Cyspy
  • 18
  • 3