40

What is the bestway to create topics in kafka?

  • How many replicas/partitions to be defined when we create topics?

In the new producer API, when i try to publish a message to a non existing topic , it first time fails and then successfully publishing.

  • I would like to know, the relationships between replica, partitions and the number of cluster nodes.
  • Do we need to create topic prior to publish messages?
Ratha
  • 9,434
  • 17
  • 85
  • 163

5 Answers5

69

When you are starting your Kafka broker you can define set of properties in conf/server.properties file. This file is just key value property file. One of the properties is auto.create.topics.enable, if it's set to true (by default) Kafka will create topics automatically when you send messages to non-existing topics.

All config options you can find are defined here. IMHO, a simple rule for creating topics is the following: number of replicas cannot be more than the number of nodes that you have. Number of topics and partitions is unaffected by the number of nodes in your cluster

for example:

  • You have 9 node cluster
  • Your topic can have 9 partitions and 9 replicas or 18 partitions and 9 replicas or 36 partitions and 9 replicas and so on...
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
ponkin
  • 2,363
  • 18
  • 25
  • 2
    Wouldn't demanding the number of replicas to be equal to the number of nodes make your cluster extremely fragile? One node goes down and suddenly your cluster no longer is responsive because it has to wait for the right number of replicas. – Seth Paulson Jan 18 '17 at 18:07
  • @SethPaulson There is no waiting because a node goes down. In that scenario the leader will remove it from the list of "in-sync" replicas, and tries to recover it should it come back. See [Kafka Documentation on Replication](https://kafka.apache.org/documentation/#replication) for a detailed description. – Jens Hoffmann Sep 21 '17 at 16:46
  • @Seth Paulson Changing the **min.insync.replicas=2** when the replication factor is 3 and acks=all. Does not wait for all 3 replicas to be up. In this case, if 2 replicas are available, cluster will continue to work uninterrupted. – Suhas Chikkanna Jan 09 '18 at 15:10
22

Partition number determines the parallelism of the topic since one partition can only be consumed by one consumer in a consumer group. For example, if you only have 10 partitions for a topic and 20 consumers in a consumer group, 10 consumers are idle, not receiving any messages. The number really depends on your application, but 1-1000s are all reasonable.

Replica number is determined by your durability requirement. For a topic with replication factor N, Kafka can tolerate up to N-1 server failures without losing any messages committed to the log. 3 replicas are common configuration. Of course, the replica number has to be smaller or equals to your broker number.

auto.create.topics.enable property controls when Kafka enables auto creation of topic on the server. If this is set to true, when applications attempt to produce, consume, or fetch metadata for a non-existent topic, Kafka will automatically create the topic with the default replication factor and number of partitions. I would recommend turning it off in production and creating topics in advance.

Lan
  • 6,470
  • 3
  • 26
  • 37
  • 1
    I'm not sure if you'll get the topic created on the consume, or fetch, According to this thread - https://www.mail-archive.com/users@kafka.apache.org/msg09182.html - "a topic can be auto created by the producer, but not the consumer". The latest docs at http://kafka.apache.org/documentation.html#brokerconfigs just say "Enable auto creation of topic on the server", without saying which actions would cause the creation. – Brian Mar 07 '17 at 01:42
  • @Lan - Could you please shed some more light on your recommendation of turning the auto-creation off in Production environment? i.e. the reasons for the same? It would be great if you could share any references that cover this in detail. Thanks in advance. – Lalit May 31 '18 at 15:02
  • @Lalit - while i can't speak for Lan, my interpretation of his comment is to turn off things you don't need in production. This is called hardening. Its a best practice. If your application requires the dynamic creation of topics, then perhaps you would want to use the feature. – Nick Jun 29 '18 at 02:54
  • @Nick - Thanks for your response. I was basically just looking for some standard patterns or the factors that could drive this decision of auto-enabling and whether they are technical or operational. I was thinking of posting this as a question hoping to invite opinions from the community. – Lalit Jun 29 '18 at 09:53
  • @Lalit - sounds good. I can tell you from my long night last night that the auto creation of topics causes some slow downs if you attempt to consume them immediately. In my case I created a dynamic data pipeline. I suspect its due to not using partitions that match the brokers. The default topic creation only uses 1 partition (maybe i'm doing it wrong?). Lowering meta_data_max_age_ms to something small like 5000 instead of the 300000 default. – Nick Jun 29 '18 at 14:12
13

I'd like to share my recent experience I described on my blog The Side Effect of Fetching Kafka Topic Metadata and also give my answers to certain questions brought up here.

1) What is the best way to create topics in kafka? Do we need to create topic prior to publish messages?

I think if we know we are going to use a fixed name Kafka topic in advance, we would be better off to create the topic before we write or read messages from it. This typically can be done in a post startup script by using bin/kafka-topics.sh see the official documentation for example. Or we can use KafkaAdminClient which was introduced in Kafka 0.11.0.0.

On the other hand, I do see certain cases where we would need to generate a topic name on the fly. In these cases, we wouldn't be able to know the fixed topic name and we can rely on the "auto.create.topics.enable" property. When it is enabled, a topic would be created automatically. And this brings up the second question:

2) Which actions would cause the creation when auto.create.topics.enable is true

Actually as @Lan already pointed out

If this is set to true, when applications attempt to produce, consume, or fetch metadata for a non-existent topic, Kafka will automatically create the topic with the default replication factor and number of partitions.

I would like to put it even simpler:

If auto topic creation is enabled for Kafka brokers, whenever a Kafka broker sees a specific topic name, that topic will be created if it does not already exist

And also the fact that fetch metadata would automatically create the topic is often overlooked by people including myself. A specific example for this is to use the consumer.partitionFor(topic) API, this method would create the given topic if it does not exist.

For anyone who is interested in more details I mentioned above, you can take a look at my own blog post on this same topic too The Side Effect of Fetching Kafka Topic Metadata.

rene
  • 41,474
  • 78
  • 114
  • 152
techpoolx
  • 131
  • 1
  • 3
1

set the property auto.create.topics.enable=true in your server.properties file, if you have multiple brokers do thee same for all the server*.properties file and restart your kafka-server. But make sure you set the partitions for an appropriate number in the server*.properties num.partitions=int, otherwise there will be a performance issue if you increase the partitions later.

jack AKA karthik
  • 885
  • 3
  • 15
  • 30
1

The basic level of parallelism in Kafka is the partition. On both the producer and the broker side, writes to different partitions can be done fully in parallel.

Things to keep in mind

  • More Partitions Requires More Open File Handles
  • More Partitions May Increase Unavailability
  • More Partitions May Increase End-to-end Latency

As a rule of thumb, it’s probably a good idea to limit the number of partitions per broker to 100 x b x r, where b is the number of brokers and r is the replication factor.

For example: If you have 9 brokers/nodes in your cluster your topic could have

  • 1800 partitions with 3 replicas, or
  • 900 partitions and 2 replicas

EDIT: See the article How to choose the number of topics/partitions in a Kafka cluster? for further details (answer has been taken from there)

Jens Hoffmann
  • 6,699
  • 2
  • 25
  • 31
  • Thank you for your answer. Please cite external sources when your answer is based on one as a courtesy to the original author. I added the link for you. – Jens Hoffmann Sep 22 '17 at 11:13
  • About the example you shared, are you assuming there's a single topic within the cluster? So, is it 1800 partitions with 3 replicas per topic or in total? I guess it's the latter? – Alper Kanat Jan 01 '19 at 14:07