2

I have a stream of events that could be categorized by types and hourly timestamps. My initial thought was to throw events into different topics(one for one category) in Kafka. However, it could easily end up with up to hundreds of topics. Plus, if they're not cleaned up properly(programed dynamically[1] in my case), the system is likely left with thousands of them. From what I have read[2], that seems to cause a significant overhead in Zookeeper.

My second thought was to stream events to one single topic and create multiple consumers. The downside of it is a waste of bandwidth because every consumer has to walk through all events to look up for ones of its interest.

Another approach is to combine my first and second method and find the tradeoff. I.e. Create one topic with multiple partitions; Some categories of events go into the same partition.

I'd like to know what the sane approach is in this scenario.

--

Community
  • 1
  • 1
cfchou
  • 1,239
  • 1
  • 11
  • 25

1 Answers1

3

I think the best strategy is to create a topic for each semantically different stream of data, and partition it when you need more parallelism. In this way you can easily set each consumer to read from the appropriate topic and adding new partitions is trivial since the consumers will automatically start consuming from the new ones.

As you suggested, is also possible to partition data based on the category of the events and set a consumer group to read from all of them, but this can create problems when you want to add more partitions (or more consumers), because you will probably need to modify the mapping between consumers and partitions. Also increasing parallelism becomes more complex.

I would suggest you not to worry about zookeeper performance at first, and start with the most natural approach. Kafka can usually handle a large amount of topic without too much overhead.

fede1024
  • 3,099
  • 18
  • 23
  • Added to @fede1024's answer, I found this post is worth reading. http://blog.confluent.io/2015/03/12/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/ – cfchou Apr 23 '15 at 12:30