5

We got one or two CheckPoint Failure during processing data every day. The data volume is low, like under 10k, and our interval setting is '2 minutes'. (The reason for processing very slow is we need to sink the data to another API endpoint which take some time to process at the end of flink job, so the time is Streaming data + Sink to external API endpoint).

The root issue is: Checkpoints time out after 10 mins, this caused by the data processing time longer than 10 mins, so the checkpoint time out. We might increase the parallelism to fast the processing, but if the data become bigger, we have to increase the parallelism again, so don't want to use this way.

Suggested solution: I saw someone suggest to set the pause between old and new checkpoint, but I have some question here is, if I set the pause time there, will the new checkpoint missing the state in the pause time?

Aim: How to avoid this issue and record the correct state that doesn't miss any data?

Failed checkpoint:

enter image description here

Completed checkpoint:

enter image description here

subtask didn't respond

enter image description here

Thanks

Dan Bonachea
  • 2,408
  • 5
  • 16
  • 31
Brian Z
  • 99
  • 1
  • 9
  • How exactly does it fail? Timeout? – TobiSH Apr 26 '19 at 05:39
  • checkpoints time out after 10 mins, so I saw someone suggest to set the pause between old and new checkpoint, but I have some question here is, if I set the pause time there, will the new checkpoint missing the state in the pause time? @TobiSH – Brian Z Apr 28 '19 at 16:13

1 Answers1

0

There are several related configuration variables you can set -- such as the checkpoint interval, the pause between checkpoints, and the number of concurrent checkpoints. No combination of these settings will result in data being skipped for checkpointing.

Setting an interval between checkpoints means that Flink won't initiate a new checkpoint until some time has passed since the completion (or failure) of the previous checkpoint -- but this has no effect on the timeout.

Sounds like you should extend the timeout, which you can do like this:

env.getCheckpointConfig().setCheckpointTimeout(n);

where n is measured in milliseconds. See the section of the Flink docs on enabling and configuring checkpointing for more details.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • I found something interesting is the failed checkpoint just record the less state compare with the completed checkpoint, and acknowledge time also less than the completed one. Any idea? I attached the screenshots. – Brian Z Apr 29 '19 at 15:20
  • It seems odd that your checkpoints are so small, yet take so long to complete. Do you have something like heavy back pressure that may be preventing the checkpoint barriers from making progress? – David Anderson Apr 29 '19 at 15:45
  • We do have the backpressure, but I'm not sure whether it prevents the checkpoint barriers to make progress. But, I observed that some subtask didn't respond that I believe it causes the checkpoint failed. I have attached the screenshot. Any idea? – Brian Z Apr 29 '19 at 20:08
  • And all other subtasks were finished quickly, only the first one in the screenshot didn' respond until time out. Couldn't figure out what's the reason there. – Brian Z Apr 29 '19 at 20:10