In Samza and Kafka Streams, data stream processing is performed in a sequence/graph (called "dataflow graph" in Samza and "topology" in Kafka Streams) of processing steps (called "job" in Samza" and "processor" in Kafka Streams). I will refer to these two terms as workflow and worker in the remainder of this question.
Let's assume that we have a very simple workflow, consisting of a worker A that consumes sensor measurements and filters all values below 50 followed by a worker B that receives the remaining measurments and filters all values above 80.
Input (Kakfa topic X) --> (Worker A) --> (Worker B) --> Output (Kafka topic Y)
If I have understood
- http://samza.apache.org/learn/documentation/0.11/introduction/concepts.html and
- http://docs.confluent.io/3.1.1/streams/architecture.html#parallelism-model
correctly, both Samza and Kafka Streams use the topic partitioning concept for replicating the workflow/workers and thus parallelizing the processing for scalability purposes.
But:
Samza replicates each worker (i.e., job) separately to multiple tasks (one for each partition in the input stream). That is, a task is a replica of a worker of the workflow.
Kafka Streams replicates the whole workflow (i.e., topology) at once to multiple tasks (one for each partition in the input stream). That is, a task is a replica of the whole workflow.
This brings me to my questions:
Assume that there is only one partition: Is it correct, that it is not possible to deploy worker (A) and (B) on two different machines in Kafka Streams while this is possible in Samza? (Or in other words: Is it impossible in Kafka Streams to split a single task (i.e., topology replica) to two machines no matter if there are multiple partitions or not.)
How do two subsequent processors in a Kafka Streams topology (in the same task) communicate? (I know that in Samza all communication between two subsequent workers (i.e., jobs) is done with Kafka topics but since one has to "mark" in Kafka Streams explicitly in the code which streams have to be published as Kafka topics this can't be the case here.)
Is it correct that Samza publishes also all intermediate streams automatically as Kafka topics (and thus makes them available to potential clients) while Kafka Streams only publishes those intermediate and final streams that one marks explicitly (with
addSink
in the low-level API andto
orthrough
in DSL)?
(I'm aware of the fact, that Samza can use also other message queues than Kafka but this is not really relevant for my questions.)