we have been using istio for some time, but have recently discovered an issue we cant explain with outlier detection. We have 50+ microservices and have discovered that on some of them "atleast 2-3" traffic does not seem to be load balancing we have tracked this down to outlier detection as once we remove it from the destination rule load balancing works correctly.
the image shows <1% of the traffic going to the pod ending in 8kh2p
My main issue is that even tho we can replicate the issue and resolve it by removing outlier detection, We are seeing no metrics to show that the circuit breaker/outlier detection has been tripped. As per this github issue - https://github.com/istio/istio/issues/8902 - we should be able to track it with something similar to
sum(istio_requests_total{response_code="503", response_flags="UO"}) by (source_workload, destination_workload, response_code)
I have also found some envoy documentation to where i should be able to track it with
envoy_cluster_circuit_breakers_default_cx_open
none of these metrics seem to show anything being triggered.
I do want to point out a similar post on stackoverflow.com which did not seem to fix our issue
If anyone could help figure out why things are not load balancing correctly with outlier detection on or at least a way that we can track that its being tripped would be much appreciated. -
kind: DestinationRule
apiVersion: networking.istio.io/v1alpha3
metadata:
name: some-service-dr
namespace: some-namespace
spec:
host: some-service.some-namespace.svc.cluster.local
trafficPolicy:
connectionPool:
http:
http1MaxPendingRequests: 1000
idleTimeout: 3s
maxRequestsPerConnection: 1
tcp:
maxConnections: 500
outlierDetection:
consecutive5xxErrors: 0 (disabling as our services expect 500s back)
consecutiveGatewayErrors: 5 (502, 503, 504 should trigger this)
interval: 10s
maxEjectionPercent: 50
tls:
mode: ISTIO_MUTUAL
Our virtual-services look like
kind: VirtualService
apiVersion: networking.istio.io/v1alpha3
metadata:
name: some-service-vs
namespace: some-namespace
spec:
hosts:
- some-service.some-namespace.svc.cluster.local
http:
- retries:
attempts: 5
perTryTimeout: 30s
retryOn: 'connect-failure,refused-stream,reset'
route:
- destination:
host: some-service.some-namespace.svc.cluster.local
port:
number: 80
exportTo:
- .
Peer Authentication
kind: PeerAuthentication
apiVersion: security.istio.io/v1beta1
metadata:
name: some-service-tls-policy
namespace: some-namespace
spec:
selector:
matchLabels:
app: some-service
mtls:
mode: STRICT
portLevelMtls: ~
Kubernetes version v1.21.x
Istio version 1.10.x
Prometheus version 2.28.x
UPDATE
I have updated our destination rule to attempt changing consecutive5xxErrors and consecutiveGatewayErrors both to 0 and the issue still persists with 2 pods one pod takes 100% of traffic with no traffic being loadbalanced to the other one. New setting below
outlierDetection:
interval: 10s
maxEjectionPercent: 50