I am getting some random timeout errors while publishing messages using Confluent.Kafka. The application runs in a Kubernetes cluster and is built using the .NET 6 framework. When the default timeout (60000ms) is reached, the message is successfully published as a result of the configured retries. These are the logs:
[08:03:34 INF] Publishing message with key 'dada33d9-067c-4528-8e85-b9f50858fb1b' and version 'v1' to topic 'topic.name'...
[08:04:34 INF] rdkafka#producer-1: REQTMOUT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Timed out ProduceRequest in flight (after 60018ms, timeout #0)
[08:04:34 WRN] rdkafka#producer-1: REQTMOUT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
[08:04:34 ERR] rdkafka#producer-1: FAIL - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: 1 request(s) timed out: disconnect (after 1247331ms in state UP)
[08:04:34 INF] A non fatal error with error code 'Local_TimedOut' occurred: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: 1 request(s) timed out: disconnect (after 1247331ms in state UP). The client will automatically try to recover from this error.
[08:04:34 INF] Message with key 'dada33d9-067c-4528-8e85-b9f50858fb1b' and version 'v1' successfully published to topic 'topic.name' in partition '[4]:4765'.
And by changing SocketTimeoutMs
from 60000ms to 5000ms, the message is published after 5 seconds:
[13:45:46 INF] Publishing message with key 'db380009-a910-43e4-8a1b-457e5d163f9a' and version 'v1' to topic 'topic.name'...
[13:45:51 INF] rdkafka#producer-1: REQTMOUT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Timed out ProduceRequest in flight (after 5103ms, timeout #0)
[13:45:51 WRN] rdkafka#producer-1: REQTMOUT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
[13:45:51 ERR] rdkafka#producer-1: FAIL - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: 1 request(s) timed out: disconnect (after 309267ms in state UP)
[13:45:51 INF] A non fatal error with error code 'Local_TimedOut' occurred: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: 1 request(s) timed out: disconnect (after 309267ms in state UP). The client will automatically try to recover from this error.
[13:45:52 INF] Message with key 'db380009-a910-43e4-8a1b-457e5d163f9a' and version 'v1' successfully published to topic 'topic.name' in partition '[7]:14638'.
I can't find a pattern for this behavior, but it happens pretty often I would say, both when the application is deployed and on my local environment.
Based on the logs from App Insights for one of our environments:
- Last 30 Days: 305 timeouts out of 55.72k messages published => 0.55%
- Last 7 Days: 92 timeouts out of 14.60k messages published => 0.62%
And from another environment where we don't have to many messages published:
- Last 30 Days: 63 timeouts out of 467 messages published => 13.49%
- Last 7 Days: 20 timeouts out of 244 messages published => 8.2%
I tried changing the default values for some of the configuration settings, like the default value for LingerMs
from 5ms to 100ms, but didn't help.
Does anyone have any idea what could be the reason behind this? I don't know if there is something wrong with my code/configuration, or something wrong with the Kafka broker. The broker is not owned or maintained by our team, but I know it is built on top of Apache Kafka, using Strimzi, running in Kubernetes.
Client configuration:
- CompressionType: CompressionType.Snappy
- Acks: Acks.Leader
- SecurityProtocol: SecurityProtocol.SaslSsl
- EnableSslCertificateVerification: true
- MessageMaxBytes: 100000000
- SocketKeepaliveEnable: true
- MessageSendMaxRetries: 10
- RetryBackoffMs: 100
- LingerMs: 100
- SocketTimeoutMs: 5000
Edit:
More logs with Debug = "broker,topic,msg"
[12:24:24 INF] Publishing message with key 'cd23c386-0855-4207-9868-c1ef0acf059c' and version 'v2' to topic 'topic.name'...
[12:24:24 INF] rdkafka#producer-1: PRODUCE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7]: Produce MessageSet with 1 message(s) (1041 bytes, ApiVersion 7, MsgVersion 2, MsgId 0, BaseSeq -1, PID{Invalid}, snappy)
[12:24:27 INF] rdkafka#producer-1: REQTMOUT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Timed out ProduceRequest in flight (after 3016ms, timeout #0)
[12:24:27 INF] rdkafka#producer-1: REQERR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: ProduceRequest failed: Local: Timed out: explicit actions Retry,MsgPossiblyPersisted
[12:24:27 INF] rdkafka#producer-1: MSGSET - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7]: MessageSet with 1 message(s) (MsgId 0, BaseSeq -1) encountered error: Local: Timed out (actions Retry,MsgPossiblyPersisted)
[12:24:27 INF] rdkafka#producer-1: REQTMOUT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
[12:24:27 INF] rdkafka#producer-1: FAIL - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: 1 request(s) timed out: disconnect (after 298222ms in state UP) (_TIMED_OUT)
[12:24:27 INF] rdkafka#producer-1: FAIL - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: 1 request(s) timed out: disconnect (after 298222ms in state UP)
[12:24:27 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state UP -> DOWN
[12:24:29 INF] rdkafka#producer-1: QRYLEADER - [thrd:main]: Topic topic.name [1]: broker is down: re-query
[12:24:29 INF] A non fatal error with error code 'Local_TimedOut' occurred: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: 1 request(s) timed out: disconnect (after 298222ms in state UP). The client will automatically try to recover from this error.
[12:24:29 INF] rdkafka#producer-1: QRYLEADER - [thrd:main]: Topic topic.name [4]: broker is down: re-query
[12:24:29 INF] rdkafka#producer-1: QRYLEADER - [thrd:main]: Topic topic.name [7]: broker is down: re-query
[12:24:29 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state DOWN -> INIT
[12:24:29 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:29 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state INIT -> TRY_CONNECT
[12:24:29 INF] rdkafka#producer-1: CONNECT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: broker in state TRY_CONNECT connecting
[12:24:29 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state TRY_CONNECT -> CONNECT
[12:24:29 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 0 Leader 0
[12:24:29 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 1 Leader 1
[12:24:29 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 2 Leader 2
[12:24:29 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 3 Leader 0
[12:24:29 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 4 Leader 1
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 5 Leader 2
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 6 Leader 0
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 7 Leader 1
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 8 Leader 2
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 9 Leader 0
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: sasl_ssl://oauth-2.test.bin.az.company.tech:9094/2: 1/1 requested topic(s) seen in metadata
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 0 Leader 0
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 1 Leader 1
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 2 Leader 2
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 3 Leader 0
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 4 Leader 1
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 5 Leader 2
[12:24:30 INF] rdkafka#producer-1: CONNECT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Connecting to ipv4#13.95.89.147:9094 (sasl_ssl) with socket 5896
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 6 Leader 0
[12:24:30 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 7 Leader 1
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 8 Leader 2
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 9 Leader 0
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: sasl_ssl://oauth-2.test.bin.az.company.tech:9094/2: 1/1 requested topic(s) seen in metadata
[12:24:30 INF] rdkafka#producer-1: CONNECT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Connected to ipv4#13.95.89.147:9094
[12:24:30 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state CONNECT -> SSL_HANDSHAKE
[12:24:30 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:30 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:30 INF] rdkafka#producer-1: CONNECTED - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Connected (#2)
[12:24:30 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state SSL_HANDSHAKE -> APIVERSION_QUERY
[12:24:30 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:30 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:30 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:30 INF] rdkafka#producer-1: AUTH - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Auth in state APIVERSION_QUERY (handshake supported)
[12:24:30 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state APIVERSION_QUERY -> AUTH_HANDSHAKE
[12:24:30 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:31 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:31 INF] rdkafka#producer-1: SASLMECHS - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker supported SASL mechanisms: OAUTHBEARER
[12:24:31 INF] rdkafka#producer-1: AUTH - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Auth in state AUTH_HANDSHAKE (handshake supported)
[12:24:31 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state AUTH_HANDSHAKE -> AUTH_REQ
[12:24:31 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:31 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:31 INF] rdkafka#producer-1: OAUTHBEARER - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: SASL OAUTHBEARER authentication successful (principal=)
[12:24:31 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state AUTH_REQ -> UP
[12:24:31 INF] rdkafka#producer-1: PRODUCE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7]: Produce MessageSet with 1 message(s) (1041 bytes, ApiVersion 7, MsgVersion 2, MsgId 0, BaseSeq -1, PID{Invalid}, snappy)
[12:24:31 INF] rdkafka#producer-1: MSGSET - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7]: MessageSet with 1 message(s) (MsgId 0, BaseSeq -1) delivered
[12:24:31 INF] Message with key 'cd23c386-0855-4207-9868-c1ef0acf059c' and version 'v2' successfully published to topic 'topic.name' in partition '[7]:18182'.