Why is checkpoint() faster than persist()

Question

I have a code that does calculations with a DataFrame.

+------------------------------------+------------+----------+----+------+
|                                Name|        Role|Experience|Born|Salary|
+------------------------------------+------------+----------+----+------+
|    瓮䇮滴ୗ┦附䬌┊ᇕ鈃디蠾综䛿ꩁ翨찘...   |    охранник|        16|1960|108111|
|     擲鱫뫉ܞ琱폤縭ᘵ௑훧귚۔᡺♧䋐滜컑...   |       повар|        14|1977| 40934|
| 㑶뇨⿟ꄳ壚ᗜ㙣޲샾ꎓ㌸翧쉟梒靻駌푤...        |   геодезист|        29|1997| 27335|
|       ࣆ᠘䬆䨎⑁烸ᯠણ ᭯몇믊ຮ쭧닕㟣紕...    | не охранн. |        4|1999 | 30000|
... ... ...

I tried to cache the table in different ways.

    def processDataFrame(mode: String): Long = {
    val t0 = System.currentTimeMillis
    val topDf = df.filter(col("Salary").>(50000))

    val cacheDf = mode match {
          case "CACHE" => topDf.cache()
          case "PERSIST" => topDf.persist()
          case "CHECKPOINT" => topDf.checkpoint()
          case "CHECKPOINT_NON_EAGER" => topDf.checkpoint(false)
          case _ => topDf
          }

    val roleList = cacheDf.groupBy("Role")
                          .count()
                          .orderBy("Role")
                          .collect()
    val bornList = cacheDf.groupBy("Born")
                          .count()
                          .orderBy(col("Born").desc)
                          .collect()

    val t1 = System.currentTimeMillis()

   
    t1-t0   // time result
  }

I got results that made me think.

Why is checkpoint(false) more efficient than persist()? After all, a checkpoint needs time to serialize objects and write them to disk.

P.S. My small project on GitHub: https://github.com/MinorityMeaning/CacheCheckpoint

score 1 · Accepted Answer · edited Oct 18 '21 at 17:53

I haven't checked your project but I think it's worth a minor discussion. I would prefer that you cleanly call out that you didn't run this code once but are averaging out several runs, to make a determination about performance on this specific dataset. (Not efficiency) Spark Clusters can have a lot noise that causes difference from job to job and averaging several runs really is required to determine performance. There are several performance factors (Data locality/Spark Executors, Resource contention, ect)

I don't think you can say "efficient" as these functions actually perform two different functionalities. They also will perform differently under different circumstance because of what they do. There are times you will want to check point, to truncate data lineage or after very computationally expensive operations. There are times when having the lineage to recompute is actually cheaper to do than writing & reading from disk.

The easy rule is, if you are going to use this table/DataFrame/DataSet multiple times cache it in memory.(Not Disk)

Once you hit an issue with a job that's not completing think about what can be tuned. From a code perspective/query perspective.

After that...

If and only if this is related to a failure of a complex job and you see executors failing, consider disk to persist the data. This should always be a later step in troubleshooting and never a first step in troubleshooting.

Yes, I made some mistakes during testing. Despite the fact that on average I always get such results, I made calculations within one spark session. I will update my question and attach the new results. — Mardaunt, Oct 21 '21 at 10:38

Why is checkpoint() faster than persist()

1 Answers1