How to upgrade or restart Spark streaming application without state loss?

Question

I'm using updateStateByKey() in my application code, and I want to save state even if I restart this application.

This can be done by saving state into somewhere every batch, but doing that may take a lot of time.

So, I wonder if there is a solution that can store state when the application is stopped.

Or is there another solution to upgrade application code without losing the current state?

score 0 · Answer 1 · answered May 23 '17 at 10:44

Currently, as of Spark 2.1.0, there isn't a solution which makes this work out of the box, you have to store the data yourself if you want to upgrade. One possibility would not be using updateStateByKey or mapWithState and storing the state somewhere external, such as in a key-value store.

Spark 2.2 is going to bring a new stateful store based on HDFS, but I haven't had a chance to look at it to see if it overcomes the weakness of the current checkpointing implementation.

score 0 · Answer 2 · answered May 23 '17 at 16:33

There are many options for saving state during each batch. I've listed the majority of them in this answer. Since you highlight the latency this adds (going over the network, serialization etc), I wanted to point out SnappyData. SnappyData deeply integrates an in-memory database with Spark such that they share the same JVM and block manager. This eliminates the serialization step during each batch which should improve latency as you write out your state. Further, it can persist the state when your application stops, as you requested.

(disclaimer: I am an employee of SnappyData)

How to upgrade or restart Spark streaming application without state loss?

2 Answers2