0

What is the fundamental difference between an event with a batch of data attached and a kafka stream that occasionally sends data ? Can they be used interchangeably ? When should I use the first and when the latter ? Could you provide some simple use cases ?

Note: There is some info in the comments of this question but I would ask for a more well rounded answer.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Cap Barracudas
  • 2,347
  • 4
  • 23
  • 54

1 Answers1

1

I assume that with "difference" between streams and events with batched data you are thinking of:

  • Stream: Every event of interest is sent immediately to the stream. Those individual events are therefore fine-grained, small(er) in size.
  • Events with data batch: Multiple individual events get aggregated into a larger batch, and when the batch reaches a certain size, a certain time has passed, or a business transaction has completed, the batch event is sent to the stream. Those batch events are therefore more coarse-grained and large(r) in size.

Here is a list of characteristics that I can think of:

  • Realtime/latency: End-to-end processing time will typically be smaller for individual events, and longer for batch events, because the publisher may wait with sending batch events until enough individual events have accumulated.

  • Throughput: Message brokers differ in performance characteristics regarding max. # of in/out events / sec at comparable in/out amounts of data. For example, comparing Kinesis vs. Kafka, Kinesis has a lower max. # of in/out events / sec it can handle than a finely tuned Kafka cluster. So if you were to use Kinesis, batch events may make more sense to achieve the desired throughput in terms of # of individual events. Note: From what I know, the Kinesis client library has a feature to transparently batch individual events if desired/possible to increase throughput.

  • Order and correlation: If multiple individual events belong to one business transaction and need to be processed by consumers together and/or possibly in order, then batch events may make this task easier because all related data becomes available to consumers at once. With individual events, you would have to put appropriate measures in place like selecting appropriate partition keys to guarantee that individual events get processed in order and possibly by the same consumer worker instance.

  • Failure case: If batch events contain independent individual events, then it may happen that a subset of individual events in a batch fails to process (irrelevant whether temporary or permanent failure). In such a case, consumers may not be able to simply retry the entire event because parts of the batch event has already caused state changes. Explicit logic (=additional effort) may be necessary to handle partial processing failure of batch events.

To answer your question whether the two can be used interchangeably, I would say in theory yes, but depending on the specific use case, one of the two approaches will likely result better performance or result in less complex design/code/configuration.

I'll edit my answer if I can think of more differentiating characteristics.

Christoph
  • 2,211
  • 1
  • 16
  • 28