Dataframe Checkpoint Example Pyspark

Question

I read about checkpoint and it looks great for my needs but I couldn't find a good example of how to use it.
My questions are:

Should I specifiy the checkpoint dir? Is it possible to do it like this:

df.checkpoint()
Are there any optional params that I should be aware about?
Is there a default checkpoint dir or I must specify one as default?
When I checkpoint dataframe and I reuse it - It autmoatically read the data from the dir that we wrote the files?

It will be great if you can share with me example of using checkpoint in pyspark with some explanation. Thanks!

score 8 · Accepted Answer · answered Jun 10 '21 at 09:12

You should assign the checkpointed dataframe to a variable as checkpoint "Returns a checkpointed version of this Dataset" (https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.checkpoint.html). So

df = df.checkpoint()

The only parameter is eager which dictates whether you want the checkpoint to trigger an action and be saved immediately, it is True by default and you usually want to keep it this way.

You have to set the checkpoint directory with SparkContext.setCheckpointDir(dirName) somewhere in your script before using checkpoints. Alternatively if you want to save to memory instead you can use localCheckpoint() instead of checkpoint() but that is not reliable and in case of issues/after termination the checkpoints will be lost (but it should be faster as it uses the caching subsystem instead of only writing to disk).

And yes, it should be read automatically, you can look at the history server and there should be "load data" nodes (I don't remember the exact name) at the start of blocks/queries

I tried the above checkpointing in my spark application by the checkpoint directory has a new folder created without any contents? Is this expected? — Appden65, Mar 02 '23 at 01:52

Dataframe Checkpoint Example Pyspark

1 Answers1