Let's write a simple wordcount job
DataStream<Tuple2<String, Integer>> counts =
text.flatMap(new Tokenizer())
.keyBy(value -> value.f0)
.sum(1);
(source and other details are irrilevant) Suppose into the pipeline arrives the string
"the cat is on the table"
Result is:
<the - 1>
<cat - 1>
<is - 1>
<on - 1>
<the - 2>
<table - 1>
The only word found twice is "the".
It seems that sum()
function is stateful, mainteing at least the last <word - count> tuple updates when a new tuple <word, 1> arrives (obviosly partitioned by word value).
If it is true, and checkpointing is enabled, is this "state" saved into checkpoint and recovered in case of failures?