Questions tagged [checkpointing]
105 questions
48
votes
3 answers
What is the difference between .pt, .pth and .pwf extentions in PyTorch?
I have seen in some code examples, that people use .pwf as model file saving format. But in PyTorch documentation .pt and .pth are recommended. I used .pwf and worked fine for small 1->16->16 convolutional network.
My question is what is the…

Asil
- 684
- 1
- 6
- 15
21
votes
3 answers
spark streaming checkpoint recovery is very very slow
Goal: Read from Kinesis and store data in to S3 in Parquet format via spark streaming.
Situation:
Application runs fine initially, running batches of 1hour and the processing time is less than 30 minutes on average. For some reason lets say the…

Gaurav Shah
- 5,223
- 7
- 43
- 71
14
votes
3 answers
How to checkpoint a long-running function pythonically?
The typical situation in computational sciences is to have a program that runs for several days/weeks/months straight. As hardware/OS failures are inevitable, one typically utilize checkpointing, i.e. saves the state of the program from time to…

Alexander Pozdneev
- 1,289
- 1
- 13
- 31
13
votes
1 answer
Reliability issues with Checkpointing/WAL in Spark Streaming 1.6.0
Description
We have a Spark Streaming 1.5.2 application in Scala that reads JSON events from a Kinesis Stream, does some transformations/aggregations and writes the results to different S3 prefixes. The current batch interval is 60 seconds. We have…

MiguelPeralvo
- 837
- 1
- 11
- 19
13
votes
1 answer
Spark streaming checkpoints for DStreams
In Spark Streaming it is possible (and mandatory if you're going to use stateful operations) to set the StreamingContext to perform checkpoints into a reliable data storage (S3, HDFS, ...) of (AND):
Metadata
DStream lineage
As described here, to…

Pablo Francisco Pérez Hidalgo
- 27,044
- 8
- 36
- 62
10
votes
4 answers
Keras callbacks keep skip saving checkpoints, claiming val_acc is missing
I'll run some larger models and want to try intermediate results.
Therefore, I try to use checkpoints to save the best model after each epoch.
This is my code:
model = Sequential()
model.add(LSTM(700, input_shape=(X_modified.shape[1],…

xentity
- 129
- 1
- 2
- 10
7
votes
3 answers
Why does Spark throw "SparkException: DStream has not been initialized" when restoring from checkpoint?
I am restoring a stream from a HDFS checkpoint (ConstantInputDSTream for example) but I keep getting SparkException: has not been initialized.
Is there something specific I need to do when restoring from checkpointing?
I can see that it wants…

Shane Kinsella
- 267
- 2
- 13
6
votes
2 answers
How to load a checkpoint file in a pytorch model?
In my pytorch model, I'm initializing my model and optimizer like this.
model = MyModelClass(config, shape, x_tr_mean, x_tr,std)
optimizer = optim.SGD(model.parameters(), lr=config.learning_rate)
And here is the path to my checkpoint file.…

Josh Susa
- 385
- 1
- 6
- 13
5
votes
2 answers
Checkpointing Event Time Watermarks in Flink
We are receiving events from a no. of independent data sources and hence, data arriving into our Flink topology (via Kafka) would be out of order.
We are creating 1-min event time windows in our Flink topology and generating event time watermarks as…

Vijay Kansal
- 809
- 12
- 26
5
votes
2 answers
spark-scala checkpointing cleanup
I am running a spark application in 'local' mode. It's checkpointing correctly to the directory defined in the checkpointFolder config. However, there are two issues that I am seeing that are causing some disk space issues.
1) As we have multiple…

Kashyapgv
- 57
- 6
5
votes
0 answers
Stack overflow when starting spark streaming from a checkpoint
When restarting spark streaming from a checkpoint I got this exception.
Because it's not related to any code I had produce I have no idea what can cause this problem.
Any idea?
Exception in thread "streaming-start" java.lang.StackOverflowError
at…

crak
- 1,635
- 2
- 17
- 33
4
votes
1 answer
Spark Checkpointing Non-Streaming - Checkpoint files can be used in subsequent job run or driver program
This text from an interesting article: http://www.lifeisafile.com/Apache-Spark-Caching-Vs-Checkpointing/
" ... Checkpointing stores the rdd physically to hdfs and destroys the lineage that created it. The checkpoint file won’t be deleted even after…

thebluephantom
- 16,458
- 8
- 40
- 83
4
votes
0 answers
How to store eventhub checkpoint data locally in Java
I'm trying to find an example of Custom Checkpoint Manager in JAVA, that can store checkpoint data in a local folder.
Basically, I'm building a java application that reads data from azure event hub with multiple consumer groups. Previously I was…

Mani Mukhtar
- 53
- 2
4
votes
0 answers
Failing HDFS checkpointing in Spark Streaming
after deploying my Spark Streaming Job on a Standalone Spark cluster, i got some problems with checkpointing. The console log yields a hint:
WARN ReliableCheckpointRDD: Error writing partitioner org.apache.spark.HashPartitioner@2 to…

its_a_paddo
- 102
- 7
4
votes
1 answer
Is spark checkpointing faster than caching?
In my spark application, I am reading few hive tables in spark rdd and then performing few transformation on those rdds later. To avoid re computation I cached those rdds using rdd.cache() or rdd.persist() and rdd.checkpoint() methods.
As per spark…

Nachiket Kate
- 8,473
- 2
- 27
- 45