Kafka streams reduce after groupby to stream sends partial reduce output on commit

Question

We're having an issue where upon doing a groupby --> reduce --> toStream, partial reduce values are being sent downstream when a commit happens during the reduce. So if there are 65 keys to be reduced, and say a commit happens half we through, the output will be two messages: one partially reduced, the other with all the values reduced.

So here is our case in more detail:

msg --> leftJoin
leftJoin --> flatMap //break msg into parts so we can join again downstream
flatMap --> leftJoin
leftJoin --> groupByKey
groupByKey --> reduce
reduce --> toStream
toStream --> to

Currently, we've come up with a very ugly fix for this, which has to do with adding an index and out of values to each message created during the flatMap phase...we filter out any message emitted by the reduce where index != out of. My feeling is we're not doing something right here or looking at it the wrong way. Please advise on the correct way of doing this.

Thanks.

miguno · Answer 1 · 2021-09-20T09:36:10.963

So if there are 65 keys to be reduced, and say a commit happens half we through, the output will be two messages: one partially reduced, the other with all the values reduced.

If I understand your description correctly, this is actually intended behavior. For one, it's a tradeoff between processing latency (where you want to see update records as soon as you have a new piece of input data) vs. coalescing multiple update records into fewer or even just a single update record.

The default behavior of Kafka Streams is to favor lower processing latency. That is, it will not wait for "all input data to have arrived" before sending downstream updates. Rather, it will send updates once new data has arrived. Some background information is described at https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/.

Today, you have two main knobs to change/tune this default behavior, which is controlled by (1) Kafka Streams record caches (for the DSL) and (2) the configured commit interval (you already mentioned this).

Moving forward, the Kafka community has also been working on a new feature that will allow you to define that you just want a single, final update record to be sent (rather than what you described as "partial" updates). This new feature, in case you are interested, is described in the Kafka Improvement Proposal KIP-328: Ability to suppress updates for KTables. This is actively being worked on, but it will unlikely to be finished in time for the upcoming Kafka v2.1 release in October.

Currently, we've come up with a very ugly fix for this, which has to do with adding an index and out of values to each message created during the flatMap phase...we filter out any message emitted by the reduce where index != out of. My feeling is we're not doing something right here or looking at it the wrong way. Please advise on the correct way of doing this.

In short, in stream processing you should embrace the nature of how streaming works. In general, you will only have partial/incomplete knowledge of the world, so to speak, or rather: you only know what you observed thus far. So, at any given point in time, you must deal with the situation that more, additional data may arrive that you still have to deal with.

A typical situation is having to deal with late-arriving data, where your application logic must decide whether you want to still integrate and process this data (quite likely) or discard (sometimes the way it needs to be).

Going back to your example:

So if there are 65 keys to be reduced [...]

How would one know it's 65, and not 100 or 28, and so on? One can only tell that: "Thus far, at this point in time, I have received 65. So, what do I do? Do I reduce those 65 because I believe that's all the input? Or do I wait some seconds/minutes/hours longer because there might be 35 more to arrive, but this will mean that I will not send an update/answer downstream until this waiting time has elapsed (which results in higher processing latency)?"

In your situation, I would ask: Why do you consider the streaming behavior of how/when updates are being sent a problem? Perhaps it's because you have a downstream system or application that doesn't know how to handle such streaming updates?

Does that make any sense? Again, the above is based on my understanding of what you described as being the issue.

Kafka streams reduce after groupby to stream sends partial reduce output on commit

1 Answers1