2

When we apply a group by function on a stream based on some key, how does kafka calculates this as same key may be present in different partitions ? I was seeing through() function which basically repartitions the data, but i don't understand what does it mean. Will it move all the messages with same key into a single partition? Also how frequently we can call through() method ? Can we call it after receiving each message if there is a requirement ? Please suggest. Thanks

Sanjay
  • 165
  • 1
  • 13

1 Answers1

5

Data in Kafka is (by default) always partitioned by key. If you call groupBy() the grouping attribute is set as message key and thus, when the data is written into the repartition topic, all records with the same key are written into the same partition. Thus, when data is read back, the aggregation can be computed correctly in the aggregate() function.

Note, that Kafka Streams performs this repartitioning automatically (including the creation of the required topic). Calling repartition() (or through()) would achieve the same, but it's not necessary.

Also note that a Kafka Streams program is a dataflow program. When using the DSL, you only specify the dataflow program itself, but nothing is processed yet. Only when you call KafkaStreams#start() the dataflow program will be executed.

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
  • Thanks for the information. I still want to know after re partitioning happens on a KEY (automatically or through()) , does all the messages of a key move to a single partition? Also is it safe to use this repartitioning frequently? – Sanjay May 18 '20 at 15:03
  • As stated in the answer: `all records with the same key are written into the same partition` -- not sure what you mean by "safe"? – Matthias J. Sax May 18 '20 at 18:14
  • As repartitioning is a costly operation, is it fine if this is performed during processing of each message? – Sanjay May 18 '20 at 20:11
  • Well, there is some overhead. But the topic is not repartitioned multiple times.. Each message is written into the repartiton topic once, and read back once. Not multiple times. – Matthias J. Sax May 19 '20 at 01:58
  • I have the same question, what confuses me is writing the record to a proper partition happens on the producer and broker side, but `groupBy` is called in the consumer side. – peon Jun 10 '21 at 12:39
  • The producer picks the partitions (broker don't make any decision about it). Also note, that Kafka Streams is uses consumers and producer internally: cf. https://docs.confluent.io/platform/current/streams/architecture.html – Matthias J. Sax Jun 10 '21 at 18:32