0

I'm trying to create a simple application which writes to Cassandra the page views of each web page on my site. I want to write every 5 minutes the accumulative page views from the start of a logical hour.

My code for this looks something like this:

KTable<Windowed<String>, Long> hourlyPageViewsCounts = keyedPageViews
            .groupByKey()
            .count(TimeWindows.of(TimeUnit.MINUTES.toMillis(60)), "HourlyPageViewsAgg")

Where I also set my commit interval to 5 minutes by setting the COMMIT_INTERVAL_MS_CONFIG property. To my understanding that should aggregate on full hour and output intermediate accumulation state every 5 minutes.

My questions now are two:

  1. Given that I have my own Cassandra driver, how do I write the 5 min intermediate results of the aggregation to Cassandra? Tried to use foreach but that doesn't seem to work.

  2. I need a write only after 5 min of aggregation, not on each update. Is it possible? Reading here suggests it might not without using low-level API, which I'm trying to avoid as it seems like a simple enough task to be accomplished with the higher level APIs.

idoda
  • 6,248
  • 10
  • 39
  • 52

1 Answers1

1

Committing and producing/writing output is two different concepts in Kafka Streams API. In Kafka Streams API, output is produced continuously and commits are used to "mark progress" (ie, to commit consumer offsets including the flushing of all stores and buffered producer records).

You might want to check out this blog post for more details: https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/

1) To write to Casandra, it is recommended to write the result of you application back into a topic (via #to("topic-name")) and use Kafka Connect to get the data into Casandra.

Compare: External system queries during Kafka Stream processing

2) Using low-level API is the only way to go (as you pointed out already) if you want to have strict 5-minutes intervals. Note, that next release (Kafka 1.0) will include wall-clock-time punctuations which should make it easier for you to achieve your goal.

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
  • Hi, thanks for commenting. I would have used Kafka connect but I need some monitoring (counts, timers) and for my understanding, I'll need to write my own connector to achieve that. – idoda Sep 10 '17 at 10:47