istio outlier detection breaking routing with no metrics

Question

we have been using istio for some time, but have recently discovered an issue we cant explain with outlier detection. We have 50+ microservices and have discovered that on some of them "atleast 2-3" traffic does not seem to be load balancing we have tracked this down to outlier detection as once we remove it from the destination rule load balancing works correctly.

the image shows <1% of the traffic going to the pod ending in 8kh2p
My main issue is that even tho we can replicate the issue and resolve it by removing outlier detection, We are seeing no metrics to show that the circuit breaker/outlier detection has been tripped. As per this github issue - https://github.com/istio/istio/issues/8902 - we should be able to track it with something similar to

sum(istio_requests_total{response_code="503", response_flags="UO"}) by (source_workload, destination_workload, response_code)

I have also found some envoy documentation to where i should be able to track it with

envoy_cluster_circuit_breakers_default_cx_open

none of these metrics seem to show anything being triggered.

I do want to point out a similar post on stackoverflow.com which did not seem to fix our issue

If anyone could help figure out why things are not load balancing correctly with outlier detection on or at least a way that we can track that its being tripped would be much appreciated. -

kind: DestinationRule
apiVersion: networking.istio.io/v1alpha3
metadata:
  name: some-service-dr
  namespace: some-namespace
spec:
  host: some-service.some-namespace.svc.cluster.local
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 1000
        idleTimeout: 3s
        maxRequestsPerConnection: 1
      tcp:
        maxConnections: 500
    outlierDetection:
      consecutive5xxErrors: 0 (disabling as our services expect 500s back)
      consecutiveGatewayErrors: 5 (502, 503, 504 should trigger this)
      interval: 10s
      maxEjectionPercent: 50
    tls:
      mode: ISTIO_MUTUAL

Our virtual-services look like

kind: VirtualService
apiVersion: networking.istio.io/v1alpha3
metadata:
  name: some-service-vs
  namespace: some-namespace
spec:
  hosts:
    - some-service.some-namespace.svc.cluster.local
  http:
    - retries:
        attempts: 5
        perTryTimeout: 30s
        retryOn: 'connect-failure,refused-stream,reset'
      route:
        - destination:
            host: some-service.some-namespace.svc.cluster.local
            port:
              number: 80
  exportTo:
    - .

Peer Authentication

kind: PeerAuthentication
apiVersion: security.istio.io/v1beta1
metadata:
  name: some-service-tls-policy
  namespace: some-namespace
spec:
  selector:
    matchLabels:
      app: some-service
  mtls:
    mode: STRICT
  portLevelMtls: ~

Kubernetes version v1.21.x

Istio version 1.10.x

Prometheus version 2.28.x

UPDATE

I have updated our destination rule to attempt changing consecutive5xxErrors and consecutiveGatewayErrors both to 0 and the issue still persists with 2 pods one pod takes 100% of traffic with no traffic being loadbalanced to the other one. New setting below

outlierDetection:
  interval: 10s
  maxEjectionPercent: 50

As per documentation: "if the value of consecutivegatewayerrors is greater than or equal to the value of consecutive5xxerrors, consecutivegatewayerrors will have no effect" - this may be the culprit. Are you sure your metrics are correct? Are you able to force outlier detection to eject a pod in a controled environment and check metrics again? — , Oct 20 '21 at 12:05
Per documentation: Note that consecutivegatewayerrors and consecutive5xxerrors can be used separately or together. consecutive5xxerrors: This feature defaults to 5 but can be disabled by setting the value to 0. which is what we have done. We do not have an app to test this with as we have been turning the 5xx errors off and we purposely do not return 502,503 or 504 with our application. I can try however setting both of them to 0 so both are disabled to see if issue still happens but that kinda defeats the point of why we want them — Asuu, Oct 20 '21 at 13:14
Yeah, setting both to 0 is not a solution. How did you install Prometheus and Kiali? Did you use the default ocnfiguration? Which profile did you use while installing Istio? — , Oct 20 '21 at 13:21
prometheus is installed with kube-prometheus-stack community helm charts and kiali via the kiali-operator charts. All other relevant istio metrics seem to be working. we frequently use the istio-requests-total metric to alert when we have large amounts of 503, 504s coming in so i know the metric works. Based on what ive found the response_flag UO should signify a circuit break - we can track requests with the other flags with no issues, example UF, URX - etc — Asuu, Oct 20 '21 at 13:25
I will note that i updated the outlier detection to turn both consecutive5xxerrors and consecutivegatewayerrors to 0 and the issue still persists - with 2 pods one pod get all the traffic and the other has received 0 traffic — Asuu, Oct 20 '21 at 14:04

istio outlier detection breaking routing with no metrics

0 Answers0