2

I have a topic with 10 partitions, and I have generate events with A,B,C,D,E,F,G,H,I 9 different keys.

I've observed messages doing this:

Partition 0- (Message1, Key E), (Message2, Key I)
Partition 1- (Message3, Key F) 
. 
. 
Partition7-(Message4, Key A), (Message5, Key A)
Partition8- Empty 
Partition9- Empty

There are 2 messages with different keys in the same partition and there are empty partitions as well.

Is the default partitioner of Kafka creating collisions?

I am producing from one stream which is balanced to two default rest producers.

This is what I was expecting:

 Partition 0- (Message1, Key E)
 Partition 1- (Message3, Key F) 
 . 
 . 
 Partition7-(Message4, Key A), (Message5, Key A)
 Partition8-(Message2, Key I) 
 Partition9- Empty
Dipperman
  • 119
  • 1
  • 12

2 Answers2

9

Kafka's DefaultPartitioner uses a murmur hash algorithm at the producer client side to assign a partition to each message. There is no guarantee that for 10 partitions and single digit number of keys, they will be uniformly distributed. Calculation of partition for each message is independent of each other and the probability of collision is a mathematical interest.

EDIT:

It is very unlikely that murmur hash algorithm results in a collision. Partitions in Kafka topic is fixed - it cannot grow unlike bucket size in java HashMap implementation. So partition algorithm uses a formula which calculates modulo of number of partitions. Exact formula is Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;

Now you can see that two different keys can indeed result in same partition number if hash mod number of partitions results in same value.

For a large number of random key set, keys will be uniformly distributed across all partitions.

If you want ordering, then you must use a partition key..in which case your worries surrounding collisions and empty partitions have little practical consequences (well, for a large set of random keys, they will be ok). If you assumed that Kafka would centrally make sure that empty partitions are filled first before a key is routed to an already filled partition, that is not how things work

senseiwu
  • 5,001
  • 5
  • 26
  • 47
  • what is the purpose of having a hash that create collision? Any guide in order to choose the keys for not make them collide ? – Dipperman Jun 01 '19 at 07:04
  • @Dipperman, By definition hash should create collision (https://en.wikipedia.org/wiki/Hash_function) – Bartosz Wardziński Jun 01 '19 at 12:29
  • @wardziniak I know that hash by definition, can create collision, but i can not understand why do we need collision in Kafka? Not having collision would distribute uniformly. – Dipperman Jun 02 '19 at 16:24
  • 1
    If you want to distribute uniformly, leave out the key and the default partitioner will partition using round-robin – senseiwu Jun 02 '19 at 19:14
  • in that case i will have uniformly but disorderly. I need ordering the events regarding its key – Dipperman Jun 03 '19 at 06:56
1

Yes the default partitioner is going to create collisions, and it will happen at the latest when you have one more key than you have partitions. See the answer of @senseiwu which explains nicely what happens. If you have a finite set of keys and want to distribute them over the same number of partitions you must implement your own partitioner.

pgras
  • 12,614
  • 4
  • 38
  • 46