What is the difference between spark checkpoint and local checkpoint?

Question

What is the difference between spark checkpoint and local checkpoint? When making local checkpoint I see this in the spark UI:

It shows that local checkpoint is saved on memory.

In the image, this storage are relationed with the operations. In the code I use: df.localCheckpoint() — Shadowtrooper, Nov 14 '19 at 13:50

LizardKing · Accepted Answer · 2019-11-14T14:47:11.027

8

Local checkpoint stores your data in executors storage (as shown in your screenshot). It is useful for truncating the lineage graph of an RDD, however, in case of node failure you will lose the data and you need to recompute it (depending on your application you may have to pay a high price).

'Standard' checkpoint stores your data in a reliable file system (like hdfs). It is more expensive to perform but you will not need to recompute the data even in case of failures. Of course, it truncates the lineage graph.

Truncating a long lineage graph avoid getting stack overflow exceptions and is particularly useful in iterative algorithms

edited Nov 14 '19 at 14:47

answered Nov 14 '19 at 14:39

LizardKing

601
6
13

5

For me local checkpointing write directly to executors *disks*, not memory. The [doc](https://spark.apache.org/docs/2.4.1/api/scala/index.html#org.apache.spark.sql.Dataset) says: "Local checkpoints are written to executor storage" – bonnal-enzo Nov 14 '19 at 14:41

bonnal-enzo · Answer 2 · 2022-11-06T02:24:21.350

4

local checkpointing writes data in executors storage
regular checkpointing writes data in HDFS

local checkpointing is faster than classic checkpointing but regular checkpointing is safer in that it leverages HDFS reliability (e.g. data blocks replication).

edited Nov 06 '22 at 02:24

answered Nov 14 '19 at 14:38

bonnal-enzo

1,165
9
19

What is the difference between spark checkpoint and local checkpoint?

2 Answers2