Questions tagged [checkpointing]

105 questions
48
votes
3 answers

What is the difference between .pt, .pth and .pwf extentions in PyTorch?

I have seen in some code examples, that people use .pwf as model file saving format. But in PyTorch documentation .pt and .pth are recommended. I used .pwf and worked fine for small 1->16->16 convolutional network. My question is what is the…
Asil
  • 684
  • 1
  • 6
  • 15
21
votes
3 answers

spark streaming checkpoint recovery is very very slow

Goal: Read from Kinesis and store data in to S3 in Parquet format via spark streaming. Situation: Application runs fine initially, running batches of 1hour and the processing time is less than 30 minutes on average. For some reason lets say the…
14
votes
3 answers

How to checkpoint a long-running function pythonically?

The typical situation in computational sciences is to have a program that runs for several days/weeks/months straight. As hardware/OS failures are inevitable, one typically utilize checkpointing, i.e. saves the state of the program from time to…
Alexander Pozdneev
  • 1,289
  • 1
  • 13
  • 31
13
votes
1 answer

Reliability issues with Checkpointing/WAL in Spark Streaming 1.6.0

Description We have a Spark Streaming 1.5.2 application in Scala that reads JSON events from a Kinesis Stream, does some transformations/aggregations and writes the results to different S3 prefixes. The current batch interval is 60 seconds. We have…
13
votes
1 answer

Spark streaming checkpoints for DStreams

In Spark Streaming it is possible (and mandatory if you're going to use stateful operations) to set the StreamingContext to perform checkpoints into a reliable data storage (S3, HDFS, ...) of (AND): Metadata DStream lineage As described here, to…
10
votes
4 answers

Keras callbacks keep skip saving checkpoints, claiming val_acc is missing

I'll run some larger models and want to try intermediate results. Therefore, I try to use checkpoints to save the best model after each epoch. This is my code: model = Sequential() model.add(LSTM(700, input_shape=(X_modified.shape[1],…
xentity
  • 129
  • 1
  • 2
  • 10
7
votes
3 answers

Why does Spark throw "SparkException: DStream has not been initialized" when restoring from checkpoint?

I am restoring a stream from a HDFS checkpoint (ConstantInputDSTream for example) but I keep getting SparkException: has not been initialized. Is there something specific I need to do when restoring from checkpointing? I can see that it wants…
Shane Kinsella
  • 267
  • 2
  • 13
6
votes
2 answers

How to load a checkpoint file in a pytorch model?

In my pytorch model, I'm initializing my model and optimizer like this. model = MyModelClass(config, shape, x_tr_mean, x_tr,std) optimizer = optim.SGD(model.parameters(), lr=config.learning_rate) And here is the path to my checkpoint file.…
Josh Susa
  • 385
  • 1
  • 6
  • 13
5
votes
2 answers

Checkpointing Event Time Watermarks in Flink

We are receiving events from a no. of independent data sources and hence, data arriving into our Flink topology (via Kafka) would be out of order. We are creating 1-min event time windows in our Flink topology and generating event time watermarks as…
Vijay Kansal
  • 809
  • 12
  • 26
5
votes
2 answers

spark-scala checkpointing cleanup

I am running a spark application in 'local' mode. It's checkpointing correctly to the directory defined in the checkpointFolder config. However, there are two issues that I am seeing that are causing some disk space issues. 1) As we have multiple…
Kashyapgv
  • 57
  • 6
5
votes
0 answers

Stack overflow when starting spark streaming from a checkpoint

When restarting spark streaming from a checkpoint I got this exception. Because it's not related to any code I had produce I have no idea what can cause this problem. Any idea? Exception in thread "streaming-start" java.lang.StackOverflowError at…
crak
  • 1,635
  • 2
  • 17
  • 33
4
votes
1 answer

Spark Checkpointing Non-Streaming - Checkpoint files can be used in subsequent job run or driver program

This text from an interesting article: http://www.lifeisafile.com/Apache-Spark-Caching-Vs-Checkpointing/ " ... Checkpointing stores the rdd physically to hdfs and destroys the lineage that created it. The checkpoint file won’t be deleted even after…
thebluephantom
  • 16,458
  • 8
  • 40
  • 83
4
votes
0 answers

How to store eventhub checkpoint data locally in Java

I'm trying to find an example of Custom Checkpoint Manager in JAVA, that can store checkpoint data in a local folder. Basically, I'm building a java application that reads data from azure event hub with multiple consumer groups. Previously I was…
4
votes
0 answers

Failing HDFS checkpointing in Spark Streaming

after deploying my Spark Streaming Job on a Standalone Spark cluster, i got some problems with checkpointing. The console log yields a hint: WARN ReliableCheckpointRDD: Error writing partitioner org.apache.spark.HashPartitioner@2 to…
its_a_paddo
  • 102
  • 7
4
votes
1 answer

Is spark checkpointing faster than caching?

In my spark application, I am reading few hive tables in spark rdd and then performing few transformation on those rdds later. To avoid re computation I cached those rdds using rdd.cache() or rdd.persist() and rdd.checkpoint() methods. As per spark…
Nachiket Kate
  • 8,473
  • 2
  • 27
  • 45
1
2 3 4 5 6 7