kafka connect failed to start with 'Uncaught exception in herder work thread.... position could be determined''

Question

I am using Kafka connect Version confluentinc/cp-kafka-connect:5.1.1-1. And kafka cluster kafka_2.11-0.11.0.3 (3 brokers)

This kafka cluster is working fine with old producer / consumer - using spark-stream.

Now I tried to add kafka connect and I get the following error:

ERROR Uncaught exception in herder work thread, exiting:  

(org.apache.kafka.connect.runtime.distributed.DistributedHerder)
org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before the position for partition kc-offsets-22 could be determined

I can see that this topic exists. I can even write and read to this topic using the following commands:

./kafka-console-producer.sh \
    --broker-list `hostname`:9092 \
    --topic kc-offsets \
    --property "parse.key=true" \

./kafka-console-consumer.sh --zookeeper $KAFKA_ZOOKEEPER_CONNECT --topic kc-offsets --from-beginning --property print.key=true

The kafka connect machine has connection to all of my brokers.

But from some reason the kafka connect will not start.

I will really appreciate any suggestion of how to investigate/solve it

UPDATES: I have tried to change the replica to 1 as suggested here but it didn't help

score 0 · Answer 1 · answered Aug 04 '19 at 10:46

The error indicates that some records are put into the queue at a faster rate than they can be sent from the client.

When your Producer (Kafka Connect in this case) sends messages, they are stored in buffer (before sending the to the target broker) and the records are grouped together into batches in order to increase throughput. When a new record is added to the batch, it must be sent within a -configurable- time window which is controlled by request.timeout.ms (the default is set to 30 seconds). If the batch is in the queue for longer time, a TimeoutException is thrown and the batch records will then be removed from the queue and won't be delivered to the broker.

Increasing the value of request.timeout.ms should do the trick for you.

In case this does not work, you can also try decreasing batch.size so that batches are sent more often (but this time will include fewer messages) and make sure that linger.ms is set to 0 (which is the default value).

If you still get the error I assume that something wrong is going on with your network. Have you enabled SSL?

Thank you @Giorgos Myrianthous, 1) I don't understand though why kafka-connect sends a lot of messages before I even added a connector ?, 2) How can I set batch.size and linger.ms in kafka-connect, 3) I am not using SSL — Ehud Lev, Aug 04 '19 at 12:38

score 0 · Answer 2 · answered Aug 05 '19 at 13:43

I found the root cause of that issue, We had a corrupted __consumer_offsets topic. Before kafka-connect we used old style consumers so we didn't see this issue. The solution for us was to create new kafka cluster, and that solved the problem.

BTW to see that the topic is working just need to read from it:

./kafka-console-consumer.sh --zookeeper $KAFKA_ZOOKEEPER_CONNECT --topic __consumer_offsetss --from-beginning --property print.key=true

kafka connect failed to start with 'Uncaught exception in herder work thread.... position could be determined''

2 Answers2

Linked