12

What are the differences between Apache Beam and Apache Kafka with respect to Stream processing? I am trying to grasp the technical and programmatic differences as well.

Please help me understand by reporting from your experience.

pascalwhoop
  • 2,984
  • 3
  • 26
  • 40
Stella
  • 1,728
  • 5
  • 41
  • 95
  • 1
    Beam requires a cluster scheduler to run the code. Kafka Streams can be embedded within any Java application. That's one of the main differences. Beam can communicate with more streams than only Kafka – OneCricketeer Jun 15 '18 at 11:25
  • Cluster scheduler meaning "Runners" right? Beam stream cannot be embedded within any java app? How do we find Beam can communicate with more streams than Kafka? – Stella Jun 15 '18 at 15:47
  • I don't know Beam terminology. AFAIK, you cannot run Beam in a standalone Java application. It would need ran within a scheduler like YARN or Mesos. And Beam can read from Google DataFlow, for example, Kafka Streams cannot. – OneCricketeer Jun 15 '18 at 19:23

2 Answers2

17

Beam is an API that uses an underlying stream processing engine like Flink, Storm, etc... in one unified way.

Kafka is mainly an integration platform that offers a messaging system based on topics that standalone applications use to communicate with each other.

On top of this messaging system (and the Producer/Consummer API), Kafka offers an API to perform stream processing using messages as data and topics as input or output. Kafka Stream processing applications are standalone Java applications and act as regular Kafka Consummer and Producer (this is important to understand how these applications are managed and how workload is shared among stream processing application instances).

Shortly said, Kafka Stream processing applications are standalone Java applications that run outside the Kafka Cluster, feed from the Kafka Cluster and export results to the Kafka Cluster. With other stream processing platforms, stream processing applications run inside the cluster engine (and are managed by this engine), feed from somewhere else and export results to somewhere else.

One big difference between Kafka and Beam Stream API is that Beam makes the difference between bounded and unbounded data inside the data stream whereas Kafka does not make that difference. Thereby, handling bounded data with Kafka API has to be done manually using timed/sessionized windows to gather data.

  • 4
    "whereas Kafka does not make that difference" - I feel like this isn't discussing KTables in the Kafka Streams API – OneCricketeer Nov 01 '18 at 04:04
  • 1
    Sorry, could you please elaborate on the "handling bounded data with Kafka API has to be done manually using timed/sessionized windows to gather data". I'm using Beam currently, and although technically, as you say, "Beam makes the difference between bounded and unbounded data", it only impact what types of input and output IO you can use, but the processing code is literally exactly the same in both cases. However, how with Kafka one would be able to have a bounded source at all? Aren't Kafka Stream inputs - well, Kafka streams? – Tim Jun 10 '19 at 09:24
12

Beam is a programming API but not a system or library you can use. There are multiple Beam runners available that implement the Beam API.

Kafka is a stream processing platform and ships with Kafka Streams (aka Streams API), a Java stream processing library that is build to read data from Kafka topics and write results back to Kafka topics.

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
  • Thanks. I saw that Beam doesn't have specific Streams API like Kafka Streams API. I am wondering how does it stream data then? – Stella Jun 15 '18 at 19:23
  • 1
    Also note, that Beam offers a unified API for batch and stream processing. But as I said, it's an API only -- the actual implementation of the API are the so-called *runners* -- Beam itself does not process any data; it's not a system or library. – Matthias J. Sax Jun 15 '18 at 20:39
  • Why can't there be a Apache Beam runner for Kafka Stream processing? – user1870400 Sep 20 '18 at 15:09
  • There could be, if someone implements one. AFAIK, some existing runners support Kafka as source or sink via Beam. However, I am not aware of a runner that build on Kafka's stream processing library, Kafka Streams. – Matthias J. Sax Sep 20 '18 at 18:00