0

I am reading the article the-world-beyond-batch-streaming-102 by Tyler Akidau. For the watermark I am still a bit confused, i.e. about the code in the article:

PCollection<KV<String, Integer>> scores = input
  .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
               .triggering(AtWatermark()))

  .apply(Sum.integersPerKey());

It simply tells the engine trigger at the watermark, but how does the engine know the watermark ? As I understand it should be some kind of time delay user needs to say. Or is the engine built so smart that it tries to make one (according some default strategy or configuration) for the users ?

Thanks very much.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
Vulcann
  • 177
  • 12

1 Answers1

3

Google Dataflow (which is what Tyler Akidau is describing in the article you cite) can use a heuristic to estimate watermarks -- see this answer for more details.

Flink, on the other hand, depends on explicit watermarks which are either emitted by the data source or by a watermark generator. The most common approach is to assume a bounded delay.

David Anderson
  • 39,434
  • 4
  • 33
  • 60