2

I have a Kafka cluster in Kubernetes created using Strimzi.

apiVersion: kafka.strimzi.io/v1beta1
kind: Kafka
metadata:
  name: {{ .Values.cluster.kafka.name }}
spec:
  kafka:
    version: 2.7.0
    replicas: 3
    storage:
      deleteClaim: true
      size: {{ .Values.cluster.kafka.storagesize }}
      type: persistent-claim
    rack: 
      topologyKey: failure-domain.beta.kubernetes.io/zone
    template:
      pod:
        metadata:
          annotations:
            prometheus.io/scrape: 'true'
            prometheus.io/port: '9404'                                           
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
        authentication:
          type: tls
      - name: external
        port: 9094
        type: loadbalancer
        tls: true
        authentication:
          type: tls
        configuration:  
          bootstrap:
            loadBalancerIP: {{ .Values.cluster.kafka.bootstrapipaddress }}
          brokers:  
          {{- range  $key, $value := (split "," .Values.cluster.kafka.brokersipaddress) }}  
            - broker: {{ (split "=" .)._0 }}
              loadBalancerIP: {{ (split "=" .)._1 | quote }}
          {{- end }}
    authorization:
      type: simple

Cluster is created and up, I am able to create topics and produce/consume to/from topic. The issue is that if I exec into one of Kafka brokers pods I see intermittent errors

INFO [SocketServer brokerId=0] Failed authentication with /10.240.0.35 (SSL handshake failed) (org.apache.kafka.common.network.Selector) [data-plane-kafka-network-thread-0-ListenerName(EXTERNAL-9094)-SSL-9]

INFO [SocketServer brokerId=0] Failed authentication with /10.240.0.159 (SSL handshake failed) (org.apache.kafka.common.network.Selector) [data-plane-kafka-network-thread-0-ListenerName(EXTERNAL-9094)-SSL-11]

INFO [SocketServer brokerId=0] Failed authentication with /10.240.0.4 (SSL handshake failed) (org.apache.kafka.common.network.Selector) [data-plane-kafka-network-thread-0-ListenerName(EXTERNAL-9094)-SSL-10]

INFO [SocketServer brokerId=0] Failed authentication with /10.240.0.128 (SSL handshake failed) (org.apache.kafka.common.network.Selector) [data-plane-kafka-network-thread-0-ListenerName(EXTERNAL-9094)-SSL-1]

After inspecting these IPs [10.240.0.35, 10.240.0.159, 10.240.0.4,10.240.0.128] I figured out the all they are related to pods from kube-system namespace which are implicitly created as part of Kafka cluster deployment.

enter image description here

Any idea what can be wrong?

Inako
  • 259
  • 5
  • 13

1 Answers1

0

I do not think this is necessarily wrong. You seem to have somewhere some application trying to connect to the broker without properly configured TLS. But as the connection is forwarded the IP probably gets masked - so it does not shwo the real external IP anymore. These can be all kind of things from misconfigured clients up to some healthchecks trying to just open TCP connection (depending on your environment, the load balancer can do it for example).

Unfortunately, it is a bit hard to find out where they really come from. You can try to trace it through the logs of whoeevr owns the IP address it came from, as that forwarded it from someone else etc. You could also try to enable TLS debug in Kafka with the Java system property javax.net.debug=ssl. But that might help only in some cases with misconfigured clients, not with some TPC probes and it will also make it hard to find the right place in the logs because it will also dump the replication traffic etc. which used TLS as well.

Jakub
  • 3,506
  • 12
  • 20
  • Thank you Jakub for your response. To isolate the issue I made sure no apps are running and trying to connect to the Kafka cluster. The IPs that are having SSL issue connecting to Kafka are from kube-system namespace pods (internal pods to implement cluster features). So even if there are no external apps that are trying to connect to Kafka I still observe these errors in Kafka broker pods. – Inako Apr 08 '21 at 12:37
  • Which exact pods were the IP addresses pointing to? To the `kube-proxy` pods? Or something else? Some of the pods in `kube-system` are responsible for routing the connection to the Kafka pods. Also, as I said ... keep in mind that this does not have to be a Kafka application per se. It could be some health check probe form the load balancer or some other compoenent. It could be some monitoring tools etc. Basically anything what tries to connect to the Kafka SSL listener and does not do the TLS handshake produces an error like this. – Jakub Apr 08 '21 at 14:10
  • Yes, it mostly kube-proxy pods. These pods are internal and should not use the External ports, is that correct? If so why in the error it says "data-plane-kafka-network-thread-0-ListenerName(EXTERNAL-9094)-SSL-1" – Inako Apr 08 '21 at 14:30
  • I'm not sure what do you use for the external listener. But my understanding is that at least in some cases, the connection with node port or loadbalancer reaches one of your worker nodes and from there is is passed through the `kube-proxy` to the broker pods. This can be to some extent controller by the external traffic policy (https://strimzi.io/docs/operators/latest/full/using.html#property-listener-config-traffic-policy-reference). But in general it is expected that `kube-proxy` would be in the middle. And that can mask the original IP address. At least that is my understanding. – Jakub Apr 08 '21 at 15:31
  • Thank you for your input, Will investigate further. – Inako Apr 08 '21 at 16:37