I tested on spark 2.4.4 (EMR 5.28) and its working.
I might be a permission or config issue via EMR as previously I tried on spark 2.4.3 and I don't see any issue on checkpointing in the 2.4.4 release notes.
df = spark.range(1, 7, 2)
df.show()
rdd = df.rdd
rdd = rdd.cache()
print("Storage Level - {}".format(rdd.getStorageLevel()))
print("Is Checkpointed - {}".format(rdd.isCheckpointed()))
print("Checkpoint File - {}".format(rdd.getCheckpointFile()))
# Setting HDFS directory
sc.setCheckpointDir("/tmp/checkpoint_dir/")
rdd.checkpoint()
print("Is Checkpointed - {}".format(rdd.isCheckpointed()))
print("Checkpoint File - {}".format(rdd.getCheckpointFile()))
# Calling an action
print("count - {}".format(rdd.count()))
print("Is Checkpointed - {}".format(rdd.isCheckpointed()))
print("Checkpoint File - {}".format(rdd.getCheckpointFile()))
Output:
+---+
| id|
+---+
| 1|
| 3|
| 5|
+---+
Storage Level - Memory Serialized 1x Replicated
Is Checkpointed - False
Checkpoint File - None
Is Checkpointed - False
Checkpoint File - None
count - 3
Is Checkpointed - True
Checkpoint File - hdfs://ip-xx-xx-xx-xx.ec2.internal:8020/tmp/checkpoint_dir/5d3bf642-cc17-4ffa-be10-51c58b8f5fcf/rdd-9